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Th e present invention relates to genome-derived 
slngle e X on microarrays useful for verifying the expression 
of regions of genomic DNA predicted to encode protein. In 
of regl ons 9 inventlon re l a tes to unique genome- 

ZTT^S:^ acid probes .pressed in h uman 
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lung and single exon nucleic acid microarrays that include 
such probes . 

Background of the Invention 
5 F or almost two .decades following the invention of 

general techniques for nucleic acid sequencing, Sanger et 
al., Proc. Natl. Acad. Sci. USA 70 (4) : 1209-13 (1973); 
Gilbert et al., Proc. Natl. Acad. Sci. USA 70 (12) :3581-4 
(1973), these techniques were used principally as tools to 
10 further the understanding of proteins - known or 

suspected - about which a basic foundation of biological 
knowledge had already been built. In many cases, the 
cloning effort that preceded sequence identification had 
been both informed and directed by that antecedent 
15 biological understanding. 

For example, the cloning of the T cell receptor 
for antigen was predicated upon its known or suspected cell 
type-specific expression, by its suspected membrane 
association, and by the predicted assembly of its gene via 
20 T cell -specific somatic recombination. Subsequent 
sequencing efforts at once confirmed and extended 
understanding of this family of proteins . Hedrick et al., 
Nature 308(5955) : 153-8 (1984). 

More recently, however, the development of high 
25 throughput sequencing methods and devices, in concert with 
large public and private undertakings to sequence the human 
and other genomes, has altered this investigational 
paradigm: today, sequence information often precedes 
understanding of the basic biology of the encoded protein 
30 product . 

One of the approaches to large-scale sequencing 
is predicated upon the proposition that expressed 
sequences - that is, those accessible through isolation of 
mRNA - are of greatest initial interest. This "expressed 
35 sequence tag" ("EST") approach has already yielded vast 
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amounts of sequence data (see for example Adams et 
Science 252:1651. (1991) , Williamson, Drug Dlscov, Today 
4-H5 (1999)) . For nucleic acids sequenced by this, 
approach, often the only biological information that is 
5 Known a priori with any certainty is the livelihood of 
biologic expression itself. By virtue of the specres and 
tissue from which the mRNA had originally been obtained, 
most such sequences are also annotated with the identity of 
the species and at least one tissue in which expression 

10 appears likely. 

More recently, the pace of genomic sequencing has 

accelerated dramatically. When genomic DNA serves as the 
initial substrate for sequencing efforts, expression cannot 
be presumed; often the only a priori biological information 
,5 about the sequence includes the species and chromosome (and 
perhaps chromosomal map location) of origin. 

With the ever-accelerating pace of sequence 
accumulation by directed, EST, and genomic sequencing 
approaches - and in particular, with the accumulation of 
oo sequence information from multiple genera, from multiple 
" species within genera, and from multiple individuals within 
a species - there is an increasing need for methods that 
rapidly and effectively permit the functions of nucleic 
sequences to be elucidated. And as such functional 
25 information accumulates, there is a further need for 
methods of storing such functional information m 
meaningful and useful relationship to the sequence itself; 
that is, there is an increasing need for means and 
apparatus for annotating raw sequence data with known or 
30 predicted functional information. 

Although the increase in the pace of genomic 
sequencing is due in large part to technological changes in 
sequencing strategies and instrumentation, Service, Science 
280:995 (1998); Pennisi, Science 283: 1822-1823 (1999), 
35 there is an important functional motivation as well. 

3 
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While it was understood that the EST approach 
would rarely be able to yield sequence information about 
Z "encoding portions of the genome, it now also appears 
the EST approaoh is capable of capturing only a fraction of 
5 a genome's actual expression complexity. 

For example, when the C. elegans genome was fully 
sequenced, gene prediction algorithms identified over _ 
19 ^00 potential genes, of which only 7,000 had been found 
by' EST sequencing. C. elegans Sequencing Consortium, 
„ science 282=2012 (199.). Analogously, the recently 

completed sequence of chromosome 2 of Arabidopsis predicts 

t1 „etal Nature, 402:761 (1999), of 
over 4000 genes, Lin et al., Mac 

which only about 6% had previously been identified via 
sequencing efforts. Although the human genome has the 

a Itest depth of EST coverage, it is still woefully short 
o£ surrendering all of its genes. One recent estimae 
suggests that the human genome contains more than 146,000 
JL. which would at this point leave greater than half of 
the genes undiscovered. It is now predicted that many 

» genes, perhaps 20 to S0V. will only be found by genomic 
sequencing. 

There is, therefore, a need for methods that 
permit the functional regions of genomic sequence - and 
m ost importantly, but not exclusively, regions that 
25 function to encode genes - to be identified. 
" ' Much of the coding sequence of the human genome 

is not homologous to known genes, making detection of open 
reading frames ("ORFs") and predictions of gene function 
difficult. Computational methods exist, for predicting 
30 coding regions in eukaryotic genomes. Gene prediction 
programs such as GRAIL and GRAIL II, Uberbacher et al., 
Proc. Natl. Acad. Sci. USA 88 (24) -.11261-5 (1991); Xu et 
al Genet. Eng. 16:241-53 (1994); Uberbacher et al . , 
Methods Enzymol. 266:259-81 (1996)'; GENEFINDER, Solovyev et 
35 al mcl. Acids. Res. 22:5156-63 (1994); Solovyev et al . , 
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Ismb 5:294-302 (1997); and GENES CAN, Burge et al., J. Mol. 
Biol. 268:78-94 (1997), predict many putative genes without 
known homology or function. Such programs are known, 
however, to give high false positive rates. Burset et al., 
Genomics 34:353-367 (1996) . Using a consensus obtained by 
a plurality of such programs is known to increase the 
reliability of calling exons from genomic sequence. 
Ansari-Lari et al., Genome Res. 8(l):29-40 (1998) 

Identification of functional genes from genomic 
data remains, however, an imperfect art. For example, in 
reporting the full sequence of human chromosome 21, the 
Chromosome 21 Mapping and Sequencing Consortium reports 
that prior bioinformatic estimates of human gene number may 
need to be revised substantially downwards. .Nature 
405:311-199 (2000); Reeves, Nature 405:283-284 (2000). 

Thus, there is a need for methods and apparatus 
that permit the functions of the regions identified 
bioinformatically - and specifically, that permit the 
expression of regions predicted to encode protein - readily 
to be confirmed experimentally. 

Recently, the development of nucleic acid 
microarrays has made possible the automated and highly 
parallel measurement of gene expression. Reviewed in 
Schena (ed.), DNA Microarrays : A Practical A pproach 
25 (Practical A p proach Series ) , Oxford University Press (1999) 
(ISBN: 0199637768); Nature Genet. 21 (1) (suppl) : 1 - 60 
(1999); Schena (ed.), Microarray Biochip: Too ls and 
Technology , Eaton Publishing Company /BioTechniques Books 
Division (2000) (ISBN: 1881299376) . 

It is common for microarrays to be derived from 
cDNA/EST libraries, either from those previously described 
in the literature, such as those from the I.M.A.G.E. 
consortium, Lennon et al., Genomics 33(1) :151-2 (1996), or 
from the construction of "problem specific" libraries 
35 targeted at a particular biological question, R.S. Thomas 

5 
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et al., Cancer Res. (in press) . Such microarrays by 
definition can measure expression only of those genes found 
in EST libraries, and thus have not been useful as probes 
for genes discovered solely by genomic sequencing. 
5 The utility of using whole genome nucleic acid 

microarrays to answer certain biological questions has been 
demonstrated for the yeast Saccharomyces cerevisiae. De 
Risi et al., Science 278:680 (1997) . The vast majority of 
yeast nuclear genes, approximately 95% however, are single 
10 exon genes, i.e., lack introns, Lopez et al., RNA 5:1135- 
1137 (1999); Goffeau et al., Science 274:563-67 (1996), 
permitting coding regions more readily to be identified. 
Whole genome nucleic acid microarrays have not generally 
been used to probe gene expression from more complex 
15 eukaryotic genomes, and in particular from those averaging 
more than one intron per gene . 

Diseases of the lung are a significant cause of 
human morbidity and mortality. Increasingly, genetic 
factors are being found that contribute to predisposition, 
20 onset, and/or aggressiveness of most, if not all, of these 
diseases; although causative mutations in single genes have 
been identified for some, these disorders are, for the most 
part, believed to have polygenic etiologies. There is a 
need for methods and apparatus that permit prediction, 
25 diagnosis and prognosis of diseases of the human lung, 
particularly those diseases with polygenic etiology. 

Summary of the Invention . 
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The present invention solves these and other 
problems in the art by providing methods and apparatus for 
predicting, confirming, and displaying functional 
information derived from genomic sequence. The present 
invention also provides apparatus for verifying the 
expression of putative genes identified within genomic 

6 
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sequence . 

in particular, the invention provides novel 
genome-derived single exon nucleic acid microarrays useful 
for verifying the expression of putative genes identified 

within genomic sequence. 

The present invention also provides compositions 
and kits for the ready production of nucleic acids 
identical in sequence to, or substantially identical in 
sequence to, probes on the genome -derived single exon 
microarrays of the present invention. 

Accordingly, in a first aspect of the invention, 
there is provided a spatially-addressable set of single 
exon nucleic acid probes for measuring gene expression in a 
sample derived from human lung, comprising a plurality of 
; single exon nucleic acid probes according to any one of the 
nucleotide sequences set out in SEQ ID HOB: 1 - 12,614 or a 
complementary sequence, or a portion of such a sequence. 

By plurality is meant at least two, suitably at 
least 20, most suitably at least 100, preferably at least 
3 1000 and, most preferably, upto 5000. 

In one embodiment of the first aspect, .each of 
said plurality of probes is separately and addressably 

amplifiable.. 

In an alternative embodiment, each of said 

:5 plurality of probes is separately and addressably 
isolatable from said plurality. 

In a preferred embodiment, each of said plurality 
of probes is amplifiable using at least one common primer. 
Preferably, each of said plurality of probes is amplifiable 
30 using a first and a second common primer. 

In yet another .embodiment, said set of single 
' exon nucleic. acid probes comprises between 50 - 20,000 
probes, for example, 50 - 5000. 

Suitably, said set of single exon nucleic acid 
35 probes comprises at least 50 - 1000 discrete single exon 

7 
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nucleic acid probes having a sequence as set out rn any of 
SEQ ID ms.: 1 - 2S.00X or a complimentary sequence, or 
Dortion of such a sequence. 

P Preferably, the average length of the smgle exon 

5 nucleic acid probes is between 200 and 500 bp. It xs 

Referred that the average length should be at least 2 00bp, 
Stably at least 250bp. most suitably at least bp 
preferably at least 400bp and, most preferably, 500 » 

in another embodiment, the single exon nuclerc 
,0 acid probes lack prokaryotic and bacteriophage vector 

q!ence. It is preferred that at least 50V, suitab y a 

...Mv at least 70%, preferably at least 
least 60%, most suitably at lease , t> 

least eu-s, f sald 

75%, more preferably at least 80, 85, 90, 
single exon nucleic acid probes lack prokaryotic and 
n bacteriophage vector sequence. 

in another preferred embodiment, said single exon 
nucleic acid lack homopolymeric stretches of A or T . It rs 
preferred that at least 50%, suitably at least 60,, most 
suitably at least 70,, preferably at least 75%, more 
, preferabiy at least 80, 85, 90, 95 or 99%. of sard srngle 
Lon nucleic acid probes lack homopolymeric stretches of A 

° r T ' Preferably, a spatially-addressable set of single 

exon nucleic acid probes in accordance with the first 
25 aspect of the invention is is addressably disposed upon a 
substrate. 

Suitable substrates include a filter membrane 
which may, preferably, be nitrocellulose or nylon. The ^ 
"ion may preferably, be positively-charged. Other suitable 
so substrates include glass, amorphous silicon, crystal ine 
silicon, and plastic. Further suitable materials include 
polymethylacrylic, polyethylene, polypropylene, 
polyacrylate, pclymethylmethacrylate, polyvinylchloride, 
polytetrafluoroethylene, polystyrene, polycarbonate, 
35 polyacetal, polysulfone, celluloseacetate, 
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cellulosenitrate, nitrocellulose, and mixtures thereof. 

In a second aspect of the invention, there is 
provided a microarray comprising a spatially addressable 
set of single exon nucleic acid probes in accordance with 
5 the first aspect of the invention. 

In one embodiment, a genome -derived single-exon 
microarray is packaged together with such an ordered set of 
amplifiable probes corresponding to the probes, or one or 
more subsets of probes, thereon. In alternative 
10 embodiments, the ordered set of amplifiable probes is 
packaged separately from the genome-derived single exon 
microarray. 

In another aspect, the invention provides genome- 
derived single exon nucleic acid probes useful for gene 

15 expression analysis, and particularly for gene expression 
analysis by microarray. In particular embodiments of this 
aspect, the present invention provides human single-exon 
probes that include specif ically-hybridizable fragments of 
SEQ ID Nos. 12,615 - 25,001, wherein the fragment 

20 hybridizes at high stringency to an expressed human gene. 
In particular embodiments, the invention provides single 
exon probes comprising SEQ ID Nos. 1 - 12,614. 

Accordingly, in a third aspect of the invention, 
there is provided a single exon nucleic acid probe for 

25 measuring human gene expression in a sample derived from 
human lung which is a nucleic acid molecule comprising a 
nucleotide sequence as set out in any of SEQ ID NOs . : 1 - 

12.614 or a complementary sequence or a fragment thereof 
wherein said probe hybridizes at high stringency to a 

30 nucleic acid expressed in the human lung. 

In one embodiment, a single exon nucleic acid 
probe in accordance with the third aspect comprises a 
nucleotide sequence as set out in any of SEQ ID NOs.: 

12.615 - 25,001 or a complementary sequence or a fragment 
35 thereof . 

9 
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In a fourth aspect of the invention, there is 
provided a single exon nucleic acid probe for measuring 
human gene expression in a sample derived from human lung 
which is a nucleic acid molecule having a sequence encoding 

5 a peptide comprising a peptide sequence as set. out in any 
of SEQ ID NOs.: 25,002 - 37,012 or a complementary sequence 
or a fragment thereof wherein said probe hybridizes at high 
stringency to a nucleic acid expressed in the human lung. 

Preferably, a single exon nucleic acid probe in 

10 accordance with the third or fourth aspects of the 

invention comprises between at least 15 and 50 contiguous 
nucleotides of said SEQ ID NO: . It is preferred that the 
single exon nucleic acid probe comprises at least 15, 
suitably at least 20, more suitably at least 25 or 

15 preferably at least 50 contiguous nucleotides of said SEQ 
ID NO : . 

In another preferred embodiment, a single exon 
nucleic acid probe in accordance with the third or fourth 
aspects of the invention is between 3kb and 25kb in length. 

20 It is preferred that said probe is no more than 3kb, 

suitably no more than 5kb, more suitably no more than lOkb, 
preferably 15kb, more preferably 20kb or, most preferably, 
no more than 20kb in length. 

Preferably, a single exon nucleic acid probe in 

25 accordance with either the fifth or sixth aspect of the 
invention is DNA, preferably single -stranded DNA, RNA or 
PNA. 

In another embodiment of either the third or 
fourth aspect of the invention, a single exon nucleic acid 

30 probe is detectably labeled. Suitable detectable labels 
include a radionuclide, a fluorescent label or a first 
member of a specific binding pair. Suitable fluorescent 
labels include dyes such as cyanine dyes, preferably Cy3 
and Cy5 although other suitable dyes will be known to those 

35 skilled in the art. 

10 
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In a particularly preferred embodiment, a single 
exon nucleic acid probe in accordance with either the third 
or fourth aspect of the invention lacks prokaryotic and 
bacteriophage vector ^sequence . In yet another embodiment, a 
5 single exon nucleic acid probe in accordance with either 
the third or fourth aspect of the invention lacks 
homopolymeric stretches of A or T. 

In a fifth aspect of the invention, there is 
provided an amplifiable nucleic acid composition, 

10 comprising: 

the single exon nucleic acid probe in accordance 

with either of the third or fourth aspects of the 
invention; and at least one nucleic acid primer; 

wherein said at least one primer is sufficient to 
15 prime enzymatic amplification of said probe. 

In an sixth aspect of the invention, there is 
provided a method of measuring gene expression in a sample 
derived from human lung, comprising: 

contacting the single exon microarray in 
20 accordance with the second aspect of the invention, with a 
first collection of detectably labeled nucleic acids, said 
first collection of nucleic acids derived from mRNA of 

human lung; and then 

measuring the label detectably bound to each 

25 probe of said microarray. 

In a seventh aspect of the invention, there is 
provided a method of identifying exons in a eukaryotic 

genome, comprising: 

algorithmically predicting at least one exon from 

30 genomic sequence of said eukaryote; and then 

detecting specific hybridization of detectably 

labeled nucleic acids to a single exon probe, 

wherein said detectably labeled nucleic acids are 

derived from mRNA from the lung of said eukaryote, said 
35 probe is a single exon probe having a fragment identical in 

11 
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sequence to, or complementary in sequence .to, said 
predicted exon, said probe is included within a single exon 
microarray in accordance with the first aspect of the 
invention, and. said fragment is selectively hybridizable at 
5 high stringency. 

In a eighth aspect of the invention, there is 
provided a method of assigning exons to a single gene, 
comprising: 

'identifying a plurality of exons from genomic 
10 sequence in accordance with the seventh aspect of the 

invention; and then 

measuring the expression of each of said exons in 
a plurality of tissues and/or cell types using 
hybridization to single exon microarrays having a probe 

15 with said exon, 

wherein a common pattern of expression of said 
exons in said plurality of tissues and/or cell types 
indicates that the exons should be assigned to a single 
gene . 

20 In an ninth aspect of the invention, there is 

provided a nucleic acid sequence as set out in any of SEQ 
ID NOs: 1 - 25,001 wherein said sequence encodes a peptide. 

In a tenth aspect of the invention, there is 
provided a peptide encoded by a sequence comprising a 
25 sequence as set out in any of SEQ ID NOs: 12,615 - 25,001, 
or a complementary sequence or coding portion thereof. 

In a preferred embodiment, a peptide may be 
encoded by a sequence comprising a sequence set out in any 
of SEQ ID NOS.: 1 -12,614. 
30 in a further aspect, the invention provides 

peptides comprising an amino acid sequence translated from 
the DNA fragments, said amino acid sequences comprising SEQ 
ID NOS. : 25,002 - 37,012. 

Accordingly in a eleventh aspect of the invention 
35 there is provided a peptide comprising a sequence as set 

12 
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out in any of SEQ ID H°s = 29.002 - 37,012, or fragment 

thele0£ ' in another aspect, the invention provides means 
tor displaying annotated sequence, and in particular for 
• displaying sequence annotated according to the methods and 
' " alul of 1 — invention, further, such display 
Z be used as a preferred graphical user rnterf ace^for 
electronic search, query, and analysis of such annotated 



sequence . 
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Definitions h 
Is used herein/ the term "microarray" and phrase 

.nucleic acid microarray. refer to a substrate-bound 
collection of plural nucleic acids, hybridization to each 
of the plurality of bound nucleic acids bemg separately 
detectable. The substrate can be solid or porous, planar 
„ or non-planar, unitary or distributed. 

As so defined, the term "microarray" and phrase 
• nucleic acid microarray. include all the devices so called 
in schena fed.,, P^icro^rrsys^Pra^ 
^iosi^c^eries) , Oxford University Press (1999) 
,5 (ISBN: 0199637768); Nature Genet. 21(1) (suppl) =1 - 
' (1 999> , and schena (ed. ) , W croarray Blochip: Tools and 
IgsJSS2l2ffi , Eaton Publishing Company/BioTechniques BooKs 
bI^oTuoOO) (IS*. 1881299276). As so defmed, the 
term "microarray" and phrase "nucleic acid microarray 
,„ further include substrate-bound collections of plural 

nucleic acids in vhich the nucleic acids are distrrbutably 
disposed on a plurality of beads, rather than on a unrtary 
planar substrate, as is described, inter alia, in Brenner 
et al , Proc. Natl. Acad. ad. VSA 97 <4> .166501670 (2000); 
in such case, the term -microarray" and phrase "nucleic 
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acid microarray" refer to the plurality of beads in 
aggregate . 

As used herein with respect to a nucleic acid 
microarray, the term "probe" refers to the nucleic acid 
5 that is, or is intended to be, bound to the substrate; in 
such context, the term "target" thus refers to nucleic acid 
intended to be bound thereto by Watson- Crick 
complementarity. As used herein with respect to solution 
phase hybridization, the term "probe" refers to the nucleic 
10 acid of known sequence that is detectably labeled. 

As used herein, the expression "probe comprising 
SEQ ID NO.", and variants thereof, intends a nucleic acid 
probe, at least a portion of which probe has either (i) the 
sequence directly as given in the referenced SEQ ID NO., or 
15 (ii) a sequence complementary to the sequence as given in * 
the referenced SEQ ID NO., the choice as between sequence 
directly as given and complement thereof dictated by the 
requirement that the probe hybridize to mRNA. 

As used herein, the term "open reading frame" and 
20 the equivalent acronym "ORF" refer to that portion of an 

exon that can be translated in its entirety into a sequence 
of contiguous amino acids i.e. a nucleic acid sequence 
that, in at least one reading frame, does not possess stop 
codons; the term does not require that the ORF encode the 
25 entirety of a natural protein. 

As used herein, the term "amplicon" refers to a 
PCR product amplified from human genomic DNA, containing 
the predicted exon. 

As used herein the term "exon" refers to the 
30 consensus prediction of the various exon and gene 
predicting algorithms i.e. a nucleic acid sequence 
bioinformatically predicted to encode a portion of a 
natural protein. 

As used herein, the term "peptide" refers to a 
35 sequence of amino acids. The sequences referred to as 

14 
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PEPTIDE SEQ ID NOS . : are the predicted peptide sequences 
that would be translated from one of the exons, or a 
portion thereof set out in exon SEQ ID NOS.:. The codons 
encoding the peptide are wholly contained within the exon. 

5 As used herein, a "portions" of a defined 

nucleotide sequence or sequences can be and, preferably, 
are fragments unique to that sequence or to one or a 
combination of those sequences. A fragment unique to a 
nucleic acid molecule is one that is a signature for the 

10 larger nucleic acid molecule. 

As used herein, the phrase "expression of a 
probe" and its linguistic variants means that the ORF 
present within the probe, or its complement, is present 

within a target mRNA. 
15 As used herein, "stringent conditions" refers to 

parameters well known to those skilled in the art. When a 

nucleic acid molecule is said to be hybridisable to another 

of a given sequence under "stringent conditions" it is 

meant that it is homologous to the given sequence. 

20 as used herein, the. phrase "specific binding 

pair" intends a pair of molecules that bind to one another 

with high specificity. Binding pairs are said to exhibit 

specific binding when they exhibit avidity of at least 10 7 , 

preferably at least 10 8 , more preferably at least 10 9 

25 liters/mole. Nonlimiting examples of specific binding 
pairs are: antibody and antigen; biotin and avidin; and 
biotin and streptavidin. 

As used herein with respect to the visual display 
of annotated genomic sequence, the term "rectangle" means 

30 any geometric shape that has at least a first and a second 
border, wherein the first and second borders each are 
capable of mapping uniquely to a point of another visual 
object of the display. 

As used herein, a "Mondrian" means a visual 

35 display in which a single genomic sequence is annotated 
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with predicted and experimentally confirmed functional 
information. 



5 . Brief Description of the Drawings 

The present invention is further illustrated with 
reference to the following non-limiting figures and 
examples in which: 

10 FIG- 1 illustrates a process for predicting 

functional regions from genomic sequence, confirming the 
functional activity of s\;ch regions experimentally, and 
associating and displaying the data so obtained in 
meaningful and useful relationship to the original sequence 

15 data; 

FIG. 2 further elaborates that portion of the 
process schematized in FIG. 1 for predicting functional 
regions from genomic sequence ; 

FIG. 3 illustrates a Mondrian visual display; 
20 FIG. 4 presents a Mondrian showing a hypothetical 

annotated genomic sequence; 

FIG. 5 is a histogram showing the distribution of 
ORF length and PCR products as obtained, with ORF length 
shown in black and PCR product length shown in dotted 
25 lines; 

FIG. 6 is a histogram showing the distribution, 
among exons predicted according to the methods described, 
of expression as measured using simultaneous two color 
hybridization to a genome -derived single exon microarray . 

30 The graph shows the number of sequence-verified products 
that were either not expressed ("0"), expressed in one or 
more but not all tested tissues ("1" - M 9" ), or expressed 
in all tissues tested ("10"); 

FIG. 7 is a pictorial representation of the 

35 expression of verified sequences that showed expression 
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with signal intensity greater than 3 in at least one 
tissue, with: FIG. 7A showing the expression as measured by 
microarray hybridization in each of the 10 measured 
tissues, and the expression as measured "bioinf ormatically" 
5 by. query of EST, NR and SwissProt databases; with FIG. 7B 
showing the legend for display of physical expression 
(ratio) in FIG. 7A; and with FIG. 7C showing the legend for 
scoring EST hits as depicted in FIG. 7A; 

FIG. 8 shows a comparison of normalized CY3 
10 signal intensity for arrayed sequences that were identical 
to sequences in existing EST, NR and SwissProt databases or 
that were dissimilar (unknown) , where black denotes the 
signal intensity for all sequence-verified products with a 
BLAST Expect ("E") value of greater than le-30 (1 x 10°°) 
15 ("unknown") and a dotted line denotes sequence-verified 

spots with a BLAST expect ("E") value of less than le-30 (1 
x 10" 30 ) ("known") ; 

FIG. 9 presents a Mondrian of BAC AC008172 (bases 
25,000 to 130,000), containing the carbamyl phosphate 
20 synthetase gene (AF154830 . 1) ; and 

FIG. 10 is a Mondrian of BAC A049839. 



Methods and Apparatus for Predicting, Confirming, 
25 Annotating, and Displaying Functional Regions From Genomic 
Sequence Data 

FIG. 1 is a flow chart illustrating in broad 
outline a process for predicting functional regions from 
30 genomic sequence, confirming and characterizing the 

functional activity of such regions experimentally, and 
then associating and displaying the information so obtained 
in meaningful and useful relationship to the original 
sequence data. 

35 The initial input into process 10 of the present 

17 



WO 01/86003 PCT/US01/00665 
invention is drawn from one or more databases 100 
containing genomic sequence data. Because genomic sequence 
is usually obtained from subgenomic fragments, the sequence 
data typically will be stored in a series of records 
5 corresponding to these subgenomic sequenced fragments. 
Some fragments will have been catenated to form larger 
contiguous sequences ( "contigs" ) ; others will not. A- 
finite percentage of sequence data in the database will 
typically be erroneous, consisting inter alia of vector 
10 sequence, sequence created from aberrant cloning events, 
sequence of artificial polylinkers, and sequence that was 
erroneously read. 

Each sequence record in database 100 will 
minimally contain as annotation a unique sequence 
15 identifier (accession number) , and will typically be 
annotated further to identify the date of accession, 
species of origin, and depositor. Because database 100 can 
contain nongenomic sequence, each sequence will typically 
be annotated further to permit query for genomic sequence. 
20 Chromosomal origin, optionally with map location, can also 
be present. Data can be, and over time increasingly will 
be, further annotated with additional information, in part 
through use of the present invention, as described below. 
Annotation can be present within the data records, in 
25 information external to database 100 and linked to the 
records thereto, or through a combination of the two. 

Databases useful as genomic sequence database 100 
in the present invention include GenBank, and particularly 
include several divisions thereof, including the 
30 htgs (draft), NT (nucleotide, command line), and NR 
(nonredundant) divisions. GenBank is produced by the 
National Institutes of Health and is maintained by the 
National Center for Biotechnology Information (NCBI) . 
Databases of genomic sequence from species other than 
35 human, such as mouse, rat, Arabidopsis, C. elegans, C. 
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brigsii, Prosophila, zebra fish, and other higher 
brigS ' . mH will also prove useful as genomic 

eukaryotic organisms will aiso p 

seauence database 100. 

Genomic sequence obtained by query of genomm 
5 sequence database 100 is then input into one or more . 
lessee ,00 tor identification of regions therem £t 
le predicted to have a biological function as 
the user. Such functions include, but are not U-nxted to. 
encoding protein, regulating transcription, re^latmg 
r ss !ge transport after transcription into m^ regulate 
message splicing after transcription rnto , 
regulating message degradation after transcrrpt.on mto 
Z,, and the liKe. Other functions include directs 
Za ic recombination events, contributing to chromosomal 
15 ability or movement, contributing to allelic exclusion or 
X chromosome inactivation, and the like. 

The particular genomic sequence to be input mto 
process 20 o will depend upon the function for which 

• ko ^pntified as well as upon the 

relevant sequence is to be identified a 
2 0 approach chosen for such identification. ™s *tep 200 
can be iterated to identify different functions within a 
g IL genomic Lgion. In such case, the input often «U 
be different for the several iterations. 

Sequences predicted to have the requisite 

a-r^ then input into process 300, 
25 function by process 200 are tnen mp * 

where a subset of the input sequences suitable for 
experimental confirmation is identified. Experimental 
confirmation can involve physical and/or bioinformatic 
assay Where the subsequent experimental assay is 
30 bioinformatic, rather than physical, there are fewer 
constraints on the sequences that can.be tested, and m 
this latter case therefore process 300 can output the 
entirety of the input sequence. 

The subset of sequences output from process 300 
35 is then used in process 400 for experimental verification 
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and characterization of the function predicted in 
process 200, which experimental verification can, and often 
will, include both physical and bioinf ormatic assay. 

Process 500 annotates the sequence data with the 
5 functional information obtained in the physical and/or 
bioinf ormatic assays of process 400. Such annotation can 
be done using any technique that usefully relates the 
functional information to the sequence, as, for example, by 
incorporating the functional data into the sequence data 
10 record itself; by linking records in a hierarchical or 

relational database, by linking to external databases, by a 
combination thereof, or by other means well known within 
the database arts. The data can even be submitted for 
incorporation into databases maintained by others, such as 
15 GenBank, which is maintained by NCBI. 

As further noted in FIG. 1, additional annotation 
can be input into process 500 from external sources 600. 

The annotated data is then displayed in process 
800, either before, concomitantly with, or after optional 
20 storage 700 on nontransient media, such as magnetic disk, 
optical disc, magnetooptical disk, flash memory, or the 
like. 

FIG. 1 shows that the experimental data output 
from process 400 can be used in each preceding step of 

25 process 10: e.g., facilitating identification of functional 
sequences in process 200, facilitating identification of an 
experimentally suitable subset thereof in process 300, and 
facilitating creation of physical and/or informational 
substrates for, and performance of subsequent assay, of 

30 functional sequences in process 400. 

Information from each step can be passed directly 
to the succeeding process, or stored in permanent or 
interim form prior to passage to the succeeding process. 
Often, data will be stored after each, or at least a 

35 plurality, of such process steps: Any or all process steps 
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can be automated. 

FIG. 2 further elaborates the prediction of 
functional sequence within genomic sequence according to 
process 200. 

5 Genomic sequence database 100 is first queried '20 

for genomic sequence. 

The sequence required to be returned by query 20 
will depend, in the first instance, upon the function to be 
identified. 

10 For example, genomic sequences that function to 

encode protein can be identified inter alia using gene 
prediction approaches, comparative sequence analysis 
approaches, or combinations of the two. In gene prediction 
analysis, sequence. from one genome is input into process 

15 200 where at least one, preferably a plurality, of 

algorithmic methods are applied to identify putative coding 
regions. In comparative sequence analysis, by contrast, 
corresponding, e.g., syntenic, sequence from a plurality of 
sources, typically a plurality of species, is input into 

20 process 200, where at least one, possibly a plurality, of 
algorithmic methods are applied to compare the sequences 
and identify regions of least variability. 

The exact content of query 20 will also depend 
upon the database queried. For example, if the database 

25 contains both genomic and nongenomic sequence, perhaps 
derived from multiple species, and the function to be 
determined is protein coding regions in human genomic 
sequence, the query will accordingly require that the 
sequence returned be genomic and derived from humans . 

30 Query 20 can also incorporate criteria that 

compel return of sequence that meets operative requirements 
of the subsequent analytical method. Alternatively, or in 
addition, such operative criteria can be enforced in 
subsequent preprocess step 24. 

35 For example, if the function sought to be 
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identified is protein coding, query 20 can incorporate 
• criteria that return from genomic sequence database 100 
only those sequences present within contigs sufficiently 
long as to have obviated substantial fragmentation of any 
5 given exon among a plurality of separate sequence 
fragments . 

Such criteria can, for example, consist of a 
required minimal individual genomic sequence fragment 
length, such as 10 kb, more typically 20 kb, 30 kb, 40kb> 
10 and preferably 50 kb or more, as well as an optional 

further or alternative requirement that sequence from any 
given clone, such as a bacterial artificial chromosome 
( "BAC" ) , be presented in no more than a finite maximal 
number of fragments, such as no more than 20 separate 
15 pieces, more typically no more than 15 fragments, even more 
typically no more than about 10 - 12 fragments. 

Results using the present invention have shown 
that genomic sequence from bacterial artificial chromosomes 
(BACs) is sufficient for gene prediction analysis according 
20 to the present invention if the sequence is at least 50 kb 
in length, and if additionally the sequence from any given 
BAC is presented in fewer than 15, and preferably fewer 
than 10, fragments. Accordingly, query 20 can incorporate 
a requirement that data accessioned from BAC sequencing be 
25 in fewer than 15, preferably fewer than 10, fragments. 

An additional criterion that can be incorporated 
into the query can be the date, or range of dates, of 
sequence accession. Although the process has been 
described above as if genomic sequence database 100 were 
30 static, it is of course understood that the genomic 
sequence databases need not be static, and indeed are 
typically updated on a frequent, even hourly, basis. Thus, 
as further described in Examples 1 and 2, infra, it is 
possible to query the database for newly added sequence, 
35 either newly added after an absolute date, or newly added 
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relative to a prior analysis performed using the methods 
and apparatus of the present invention. In this way, the 
process herein described can incorporate a dynamic, 
temporal component . 

5 One utility of such temporal limitation is to 

identify, from newly accessioned genomic sequence, the 
presence of novel genes, particularly those not previously 
identified by EST sequencing (or other sequencing efforts 
that are similarly based upon gene expression) . As further 

10 described in Example 1, such an approach has shown that 

newly accessioned human genomic sequence, when analyzed for 
sequences that function to encode protein, readily 
identifies genes that are novel over those in existing EST 
and other expression databases. This makes the methods of 

15 the present invention extremely powerful gene discovery 
tools. And as would be appreciated, such gene discovery 
can be performed using genomic sequence from species other 
than human. 

If query 20 incorporates multiple criteria, such 
20 as above -de scribed, the multiple criteria can be performed 
as a series of separate queries or as a single query, 
depending in part upon the query language, the complexity 
of the query, and other considerations well known in the 
database arts. 

25 if query 20 returns no genomic sequence meeting 

the query criteria, the negative result can be reported by 
process 22, and process 200 (and indeed, entire process 10) 
ended 23, as shown. Alternatively, or in addition to 
report and termination of the initial inquiry, a new query 

30 20 can be generated that takes into account the initial 
negative result. 

When query 2 0 returns sequence meeting the query- 
criteria, the returned sequence is then passed to optional 
preprocessing 24, suitable and specific for the desired 

35 analytical approach and the particular analytical methods 
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thereof to be used in process 25. 

Preprocessing 24 can include processes suitable 
for many approaches and methods thereof, as well as 
processes specifically suited for the intended subsequent 
5 analysis. 

Preprocessing 24 suitable for most approaches and 
methods will include elimination of sequence irrelevant to, 
or that would interfere with, the subsequent analysis. 
Such sequence includes repetitive sequence, such as Alu 
iO repeats and LINE elements, vector sequence, artificial 
sequence, such as artificial polylinkers, and the like. 
Such removal can readily be performed by identification and 
subsequent masking of the undesired sequence. 

Identification can be effected by comparing the 
15 genomic sequence returned by query 20 with public or 
private databases containing known repetitive sequence, 
vector sequence, artificial sequence, and other artif actual 
sequence. Such comparison can readily be done using 
programs well known in the art, such as CROSS.MATCH, or by 
20 proprietary sequence comparison programs the engineering of 
which is well within the skill in the art. 

Alternatively, or in addition, undesirable, 
including artif actual, sequence can be identified 
algorithmically without comparison to external databases 
25 and thereafter removed. For example, synthetic polylinker 
sequence can be identified by an algorithm that identifies 
a significantly higher than average density of known 
restriction sites. As another example, vector sequence can 
be identified by algorithms that identify nucleotide or 
30 codon usage at variance with that of the bulk of the 

genomic sequence. 

Once identified, undesired sequence can be 
removed. Removal can usefully be done by masking the 
undesired sequence as, for example, by converting the 
35 specific nucleotide references to one that is unrecognized 

24 



PCT/US01/00665 

WO 01/86003 nyII 

by the subset bioinformatic algorithms, such as X . 
by the * nresent-less preferred, the undesired 

Alternatively, but at present p . c<aml __ ce 

seq uence can be excised fro. the returned genome sequence, 

leaving gaps. _ selection 

Preprocessing 2* can i.uxun 

„ rf, m licative sequences of that one sequence of 
from among duplicative h , 

, •,-„ Hioher quality can be measured as a lower 
highest quality. Higher qu y clust ered 
percentage of, fewest number of, or least dens y 
occurrence of ambiguous nucleotides, defined as those 
„ nucleotides that are identified in the genomic sequence 
Ting symbols indicating ambiguity. Higher quality can 
Z , o/lttematively be valued by presence in the longest 

C ° nt19 ' Preprocessing 2* can, and often will, also 
,5 include formatting of the data as specifically appropriate 
or Passage to the analytical algorithms of process 2S. 
such formatting can and typically will inciude, inter alia, 
• Tdlition of a unique sequence identifier, either derived 
from the original accession nnmher in genomic sequence 
» database 100, or newly applied, and can further include 
20 r/ional annotation, formatting can include conversion 
from one to another sequence listing standard, such as 
inversion to or from FASTA or the 11*.. depending upon the 
input expected by the subsequent process. 
25 preprocessing, which can be optional depending 

upon the function desired to be identified and the 
informational requirements of the methods for effecting 
such identification, is followed by sequence P™ 
wh ere sequences with the desired function are identified 

30 within the genomic sequence. , nl7lude 
R s mentioned above, such functions can include, 
but are not limited to, encoding protein, regulating 
transcription, regulating message- transport after ■ 
transcription into mRNA, regulating message splicing after 
35 transcription, of regulating message degradation, and the 
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like . Other functions include directing somatic 
recombination events, contributing to chromosomal stability 
or movement, contributing to allelic exclusion or X 
chromosome inactivation, or the like. 
5 The methods of the present invention are 

particularly useful for gene discovery, that is, for 
identifying, from genomic sequence, regions that function 
to encode genes, and in a particularly useful embodiment, 
for identifying regions that function to encode genes not 
10 hitherto identified by expression-based or directed cloning 
and sequencing. In conjunction with verification using the 
novel single exon microarrays of the present invention, as 
further described below, the methods herein described 
become powerful gene discovery tools. 
.« Accordingly, in a preferred embodiment of the 

prese nt invention, process 25 is used to identify putative 
coding regions. Two preferred approaches in process 25 for 
identifying sequence that encodes putative genes are gene 
prediction and comparative sequence analysis. 
20 Gene prediction can be performed using any of a 

number of algorithmic methods, embodied in one or more 
software programs, that identify open reading frames (ORFs) 
using a variety of heuristics, such as GRAIL, DICTION, and 
GENEFINDER . Comparative sequence analysis similarly can be 
25 performed using any of a variety of known programs that 
identify regions with lower sequence variability. 

As further described in Example 1, below, gene 
finding software programs yield a range of results. For 
the newly accessioned human genomic sequence input m 
30 Example 1, for example, GRAIL identified the greatest 

percentage of genomic sequence as putative coding region , ^ 
2*- of the data analyzed; GENEFINDER was second, calling 1«; 
and DICTION yielded the least putative coding region, with 
0 8% of genomic sequence called as coding region. 

increased reliability can be obtained when 



35 
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consensus is required among several such methods. Although 
discussed herein particularly with respect to exon calling, 
consensus among methods will in general increase 
reliability of predicting other functions as well. 
5 Thus, as indicated by query 26, sequence 

processing 25, optionally with preprocessing 24, can be 
repeated with a different method, with consensus among such 
iterations determined and reported in process 27. 

Process 27 compares the several outputs for a 
10 given input genomic sequence and identifies consensus among 
the separately reported results. The consensus itself, as 
well as the sequence meeting that consensus, is then stored 
in process 29a, displayed in process 29b, and/or output to 
process 300 for subsequent identification of a subset 
15 thereof suitable for assay. 

Multiple levels of consensus can be calculated 
and reported by process 27. For example, as further 
described in Example 1, infra, process 27 can report 
consensus as between all specific pairs of methods of gene 
20 prediction, as consensus among any one or more of the pairs 
of methods of gene prediction, or as among all of the gene 
prediction algorithms used. Thus, in Example 1, process 27 
reported that GRAIL and GENEFINDER programs agreed on 0.7% 
of genomic sequence, that GRAIL and DICTION agreed on 0 . 5% 
25 of genomic sequence, and that the three programs together 
agreed on 0.25% of the data analyzed. Put another way, 
0.25% of the genomic sequence was identified by all three 
of the programs as containing putative coding region. 

Furthermore, consensus can be required among 
30 different approaches to identifying a chosen function. 

For example, if the function desired to be 
identified is coding of protein sequence, and a first used 
approach to exon calling .is gene prediction, the process 
can be repeated on the same input sequence, or subset 
35 thereof, with another approach, such as comparative 
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sequence analysis. In such a case, where comparative 
sequence analysis follows gene prediction, the comparison 
can be performed not only on genomic nucleic acid sequence, 
but additionally or alternatively can be performed on the 
5 predicted amino acid sequence translated from the ORFs 
prior identified by the gene prediction approach. 

Although shown as an iterative process, the 
multiple analyses required to achieve consensus can be done 
in series, in parallel, or some combination thereof. 
0 ^ Predicted functional sequence, optionally 

representing a consensus among a plurality of methods and 
approaches for determination thereof, is passed to process 
300 for identification of a subset thereof for functional 

15 aSSay ' in the preferred embodiment of the methods of the 
present invention, wherein the function sought to be 
identified is protein coding, process 300 is used to 
identify a subset thereof suitable for experimental 
verification by physical and/or bioinformatic approaches. 

20 For example, putative ORFs identified in process 

200 can be classified, or binned, bioinformatically into 
putative genes. This binning can be based inter alia upon 
consideration of the average number of exons/gene in the 
species chosen for analysis, upon density of exons that 

25 have been called on the genomic sequence, and other 

empirical rules. Thereafter, one or more among the gene- 
specific ORFs can be chosen for subsequent use an gene 

expression assay. 

Where such subsequent gene expression assay uses 

30 amplified nucleic acid, considerations such as desired 
amplicon length, primer synthesis requirements, putative 
exon length, sequence GC content, existence of possible 
secondary structure, and the like can be used to identify 
and select those ORFs that appear most likely successfully 

35 to amplify. Where subsequent gene expression assay relies 
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upon nucleic acid hybridization, whether or not using 
amplified product, further considerations involving 
hybridization stringency can be applied to identify that 
subset of sequences that will most readily permit sequence- 
5 specific discrimination at a chosen hybridization and wash 
stringency. One particular such consideration is avoidance 
of putative exons that span repetitive sequence; such 
sequence can hybridize spuriously to nonspecific message, 
reducing specific signal in the hybridization. 
10 F or bioinformatic assay, there are fewer 

constraints on the sequences that can be tested 
experimentally, and in this latter case therefore process 
300 can output the entirety of the input sequence. 

The subset of sequences identified by process 300 
l5 as suitable for use in assay is then used in process 400 to 
create the physical and/or informational substrate for 
experimental verification of the predictions made in 
process 200, and thereafter to assay those substrates. 

As mentioned, the methods of the present 
invention are particularly useful for identifying potential 
coding regions within genomic sequence . In a preferred 
embodiment of process 400, therefore, the expression of the 
sequences predicted to encode protein is verified. The 
combination of the predictive and experimental methods 
25 provides a powerful gene discovery engine. 

Thus, in another aspect, the present invention 
provides methods and apparatus for verifying the expression 
of putative genes identified within genomic sequence. In 
particular, the invention provides a novel method of 
verifying gene expression in which expression of predated 
ORFs is measured and confirmed using a novel type of 
nucleic acid microarray, the genome -derived single exon 
nucleic acid microarrays of the present invention. 

Putative ORFs as predicted by a consensus of gene 
35 calling, particularly gene prediction, algorithms in 
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process 200, and as further identified as suitable by 
process 300, are amplified from genomic DNA using the 
polymerase chain reaction (PCR). Although PCR is 
conveniently used, other amplification approaches can also 
5 be used. 

Amplification schemes can be designed to capture 
the entirety of each predicted ORF in an amplicon with 
minimal additional (that is, intronic or intergenic) 
sequence. Because ORFs predicted from human genomic 
10 sequence using the methods of the present invention differ 
in length, such an approach results in ampl icons of varying 
length. 

However, most predicted ORFs are shorter than 500 
bp in length, and although amplicons of at least about 100 

15 or 200 base pairs can be immobilized as probes on. nucleic 
acid microarrays, early experimental results using the 
methods of the present invention have suggested that longer 
amplicons, at least about 400 or 500 base pairs, are more 
effective. Furthermore, certain advantages derive from 

20 application to the microarray of amplicons of defined size. 
Therefore, amplification schemes can 
alternatively, and preferably, be designed to amplify 
regions of defined size, preferably at least about 300, 400 
or 500 bp, centered about each predicted ORF. Such an 

25 approach results in a population of -amplicons of limited 
size diversity, but that typically contain intronic and/or 
intergenic nucleic acid in addition to putative ORF. 

Conversely, somewhat fewer than 10% of ORFs 
predicted from human genomic sequence according to the 

30 methods of the present invention exceed 500 bp in length. 
Portions of such extended ORFs, preferably at least about 
300,400 or 500 bp in length, can be amplified. However, it 
has been discovered- that the percentage success at 
amplifying pieces of such ORFs is low, and that such 

35 putative exons are more effectively amplified when larger 

30 



PCT/US01/00665 

WO 01/86003 — 

la-rap as 2000 bp are amplified. 

^ The putative OKPs selected in process 300 are 

thus input into one or .ore primer des ig n program, such as 
, PRIMER3 (available online for ^ « ^ . goal 

of «U^ ^ ^ oRps predicted to be no 

seg uence ~^ Qr >t least ^ 10 oo - 1500 bp of 

mo re than about 500 bp, or 
„ genomic seance for ™*J*^ to exceed 
length, and the primers synthesized oy 
Primers with the reguisite seguences can be purchased 
co-ercially or synthesized by standard 

conveniently, a first predetermined seguence can 
15 be added commonly to the ORF-specific S> primer and a 
second, typically different, predetermined 
commonly added to each ORr-unigue prime. h - r.es 
to immortalize the amplicon, that is, serves P 
further amplification of any amplicon using a single^et 
20 primers complementary respectively to the commo S and 
common V seguence elements. The presence o these 
■universal" priming seguences further facilitates later 
aeguence verification, providing a seguence common to all 
OT plicons at which to prime seguencing reaction- The 
J5 common and 3- seguences further serve to add * cloning 

„ 3„ „, of the ORFS warrant further study, 
site should any or tne om Toast 
Such predetermined seguence is usefully at least 

about 10, 12 or 15 nt in length, and usually does not 
. f X ceed Lut ,5 nt in length. The .universal" priming 
M seguences used in the examples presented infra were each 16 

nt l0n9 ' The genomic DNA to be used as substrate for 
amplification will come from the eufcaryotic species from 
which the genomic seguende data had originally been 
35 obtained, or a closely related species, and can 
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•7 nd edition December 1989), WIQ B *'* a , 
LstdSB^ „.««».*> . Many sue, 
,. a re available commercially, with the human genomic DHAb 
Xitionally having certification of donor .nfor.ec 

C ° nSe,lt ' Although the intronic and intergenic material 
flanking putative coding -regions in the amplicons could 
T Zllv interfere with hybridizations during microarray 
15 potentially interie differential 
. ,. t . we have found, surprisingly, that cut 

r^os are ^ 

Z predominant effect of exon size is to alter the 
h o ute signal intensity, rather than its rati. 
20 surprising, the art had suggested that single exon probes 

id not provide sufficient signal intensity for high 
■ IZ ing "cy bridization analyses, we find that such probes 
Z only provide adeguate signal, but have substantial 

spin column, with or without confirmation as to amplicon 
Tality as by gel electrophoresis, each amplicon single 
exon probe) is disposed in an array upon a support 
suhstrate.^^ ^ „ y deposition 

and fixation of nucleic acids onto support substrates are 
well xnown in the art (Reviewed by Schena et al., 

abOV8) ' Typically, the support substrate will be glass, 
,5 ^though other materials, such as amorphous or crystalline 
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silicon or plastics. Such plastics include 
polymethylacrylic, polyethylene, polypropylene, 
polyacrylate, polymethylmethacrylate, polyvinylchlor.de, 
polytetrafluoroethylene, polystyrene, polycarbonate, 
5 polyacetal, polysulfone, celluloseacetate, 

cellulosenitrate, nitrocellulose, or mixtures thereof can 
also be used. Typically, the support will be rectangular, 
although other shapes, particularly circular disks and even 
spheres, present certain advantages. Particularly 
, 0 advantageous alternatives to glass slides as support 

substrates for array of nucleic acids are optical discs, as 

described in WO 98/12559. 

The amplified nucleic acids can be attached 
covalently to a surface of the support substrate or, more 
15 typically, applied to a derivatized surface in a chaotropic 
agent that facilitates denaturation and adherence by 
presumed noncovalent interactions, or some combination 
thereof . 

Robotic spotting devices useful for arraying 
20 nucleic acids on support substrates can be constructed 
using public domain specifications (The MGuide, version 
2 0 http;//cmgm. stanford.edu/pbrown/mguide/index. html), or 
can' conveniently be purchased from commercial sources 
(MicroArray Genii Spotter and MicroArray Genlll Spotter, 
25 Molecular Dynamics, Inc., Sunnyvale, CA) . Spotting can 

also be effected by printing methods, including those using 

ink jet technology. 

As is well known in the art, microarrays 
typically also contain immobilized control nucleic acids. 

30 For controls useful in providing measurements of background 
signal for the genome -derived single exon microarrays of 
the present invention, a plurality of B. coli genes can 
readily be used. As further described in Example 1, 16 or 
32 E. coli genes suffice to provide a robust measure of 

35 background noise in such microarrays. 
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As is well known in the art, the amplified 
product disposed in arrays on a support substrate to create 
a nucleic acid microarray can consist entirely of natural 
nucleotides linked by phosphodiester bonds, or 
5 alternatively can include either nonnative nucleotides, 
alternative internucleotide linkages, or both, so long as 
complementary binding can be obtained in the hybridization. 
If enzymatic amplification is used to produce the 
immobilized probes, the amplifying enzyme will impose 
10 certain further constraints upon the types of nucleic acid 
analogs that can be generated. 

Although particularly described herein as using 
high density microarrays constructed on planar substrates, 
the methods of the present invention for confirming the 
15 expression of ORFs predicted from genomic sequence can use 
any of the known types of microarrays, as herein defined, 
including lower density planar arrays, and microarrays on 
nonplanar, nonunitary, distributed substrates. 

For example, gene expression can be confirmed 
20 using hybridization to lower density arrays, such as those 
constructed on membranes, such as nitrocellulose, nylon, 
and positively-charged derivatized nylon membranes. 
Further, gene expression can also be confirmed using 
nonplanar, bead-based microarrays such as are described in 
25 Brenner et al . , Proc. Natl. Acad. Sci. USA 97 (4) : 166501670 
(2000); U.S. Patent No. 6,057,107; and U.S. Patent No. 
5,736,330. In theory, a packed collection of such beads 
provides in aggregate a higher density of nucleic acid 
probe than can be achieved with spotting or lithography 
30 techniques on a single planar substrate. 

Planar microarrays on solid substrates, however, 
provide certain useful advantages, including high 
throughput and compatibility with existing readers. For 
example, each standard microscope slide can include at 
35 least 1000, typically at least 2000, preferably 5000 and 
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upto 10,000 - 50,000 or more nucleic acid probes of 
discrete sequence. The number of sequences deposited will 
depend on their required application. 

Each putative gene can be represented in the 
5 array by a single predicted ORF. Alternatively, genes can 
be represented by more than one predicted ORF. For 
purposes of measuring differential splicing, more than one 
predicted ORF will be provided for a putative gene. And as 
is well known in. the art, each probe of defined sequence, 
10 representing a single predicted ORF, can be deposited in a 
plurality of locations on a single microarray to provide 
redundancy of signal. 

The genome-derived single exon microarrays 
described above differ in several fundamental and 
15 advantageous ways from microarrays presently used in the 
gene expression art, including (1) those created by 
deposition of mRNA-derived nucleic acids, (2) those created 
by in situ synthesis of oligonucleotide probes, and (3) 
those constructed from yeast genomic DNA. 
20 Most nucleic acid microarrays that are in use for 

study of eukaryotic gene expression have as immobilized 
probes nucleic acids that are derived - either directly or 
indirectly - from expressed message. As discussed above, 
it is common, for example, for such microarrays to be 
25 derived from cDNA/EST libraries, either from those 

previously described in the literature, see Lennon et al., 
or from the de novo construction of "problem specific" 
libraries targeted at a particular biological question, 
R.S. Thomas et al., Cancer Res. (in press) . Such 
30 microarrays are herein collectively denominated "EST 

microarrays" . 

Such EST microarrays by definition can measure 
expression only of those genes found in EST libraries, 
shown herein to represent only a fraction of expressed 
35 genes. Furthermore, such libraries - and thus, microarrays 
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«OWNM> are biase d by the tissue or cell type o! 

baS iroI"n ;y L Session ievels of the respects 
message origin, oy ability of the message 

. f v: n fhe tissues,, and oy tne cxk>±j.^~x 

Thus as further discussed in Example 1, the 
5 m ethods of the present invention enable seances that do 
not appear in EST or other expression databases to be 
determined - subsequently arrayed for expressron 
m easurements could not, therefore, have been 
, 0 probes on an EST microarray. *nd as further demonstrated 
L the examples, infra, the remaining population of genes 
identified from genomic sequence by the methods of the 
/esent invention - that is, the one third.f sequences 
that had previously been accessioned in EST or «her 
u expression databases - are biased toward genes «th hrgher 

^-=^- ntation rf . message in an EST and/or c« 
library depends upon the successful reverse transcription, 
penally but typically ^^^L bias 
M ~ tZtZ * available for arraying in EST 

■"icroarrays^ ^ transcription n or 

cloning is required to produce the probes arrayed on the 

derived single exon microarrays of the present 
25 genome -derived sinyj-e b 
invention, and although the ultimate deposrtron of a probe 
on the genome-derived single exon microarray of ^ present 
invention depends upon a successful amplication rom^ 
genomic material, a priori knowledge of the sequence of the 
30 Lire* amplicon affords greater opportunity to recover y 
given probe sequence recalcitrant to amplrf rcatxon than 
afforded by the requirement for successful reverse 
transcription and cloning of unknown message rn EST 

35 aPPr ° aCheS Thus. the genome-derived single exon microarrays 
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P«sent invention present a far greater diversity - 
tLs for measuring gene egression, with far less b». 
than do EST microarrays presently need in the art. 

As a further consequence of their ultimate origin 
5 from expressed message, the probes in EST microarrays often 
contain poly-A <or complementary poly-T) stretches derrved 
^VpoU tail of mature These 
wretches contribute to cross-hybridizatron, that 
spurious signal occasioned by hybridization to the 
„ homopolymeric tail of a labeled c D «A that lacxs sequence 
homology to the gene-specific portion of the probe. 

m contrast, the probes arrayed in the genome- 
derived single exon microarrays of the present invention 
lac* homopolymeric stretches derived from message 
„ polyadenylation. and thus can provide more specrfrc srgnal. 
Really, at least about 50, 60 or 75% of the probes on 
Z genome-derived single exon microarrays of the present 
invention lack homopolymeric regions cons.strng of A or T 
Ire a homopolymeric region is defined for purposes here n 
20 as stretches of 25 or more, typically 30 or more, rdent.cal 

nucleotides. 

A further distinction, which also affects the 

specificity of hybridization, is occasioned by the typical 
derivation of EST microarray probes from cloned material. 
25 Because much of the probe material disposed as probes on 
EST microarrays is excised or amplified from plasnud, 
phage, or phagemid vectors, EST microarrays 
Include a fair amount of vector segnence, more so when the 

111 liflad rather than excised, from the vector, 

probes are amplified, ratner ti 

in contrast, the vast majority of probes rn the 
genome-derived single exon microarrays of the present 
invention contain no proxaryotic or bacteriophage vector 
sequence, having been amplified directly or mdrrect ly rom 
genomic DMA. Typically, therefore, at least about 50, 60, 
70 or 80% or more of individual exon-including probes 
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W ° 01/86003 derived single exon microarray of the 

present plastni ds and bacteriophage. 

WeX ^ — P SS, ,0 or more than - o £ » 
prefers y, geno me-derived single exon 

microarray of the present ces through 

Mith at r in rr ^ ul - — g 

preprocessing 24 P substa ntial absence 

. r^rreLrrtL — - 

iLroarrays of the present invention results in greetel 
i; icily -ring hybridization, since spurious cross- 
ny Lidi,ation to a probe vector seance is reduced. 

fla a further consequence o£ excision or 
iification of probes from vectors in constructs of EST 

15 :rr^r- u- «-»- *— 

aIti ficial seguence der ived fro. 
mi1 Tfi.Tile cloning sites, at botn s 

I ! d upon the genome-derived single exon microarrays 
20 need have no such artificial seguence ™^* ^£ c 
M mentioned above, however, the ORF specu 
primers used to amplify putative ORFs can include 
primers us tvDica lly 5' to the ORF- specific 

artificial sequences, typically 5 
primer seguence, useful for -universal (that is, 
,5 independent of ORF seguence,. priming of subsequent 
\ e -nation or sequencing reactions. When such 

r. amplification primers, the probes -£>££» 
genome-derived single exon microarray will include 
,0 artificial sequence similar to that found in EST 

croarrays However, the genome-derived single exon 
I "of the present invention can be made without 

„ M and if so constructed, presents an even 
such sequences, ana " " u 

„„Y,at>eci£ic sequence that would 
smaller amount of nonspecitic seq 

contribute to nonspecific hybridization. 
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vet another consequence of typical use of cloned 
material as probes in EST microarrays is that such 
I croarrays contain probes that result fro. cloning 

1^: such as ^j^zzr* 

' r89iM f Z^^Z ^, the probes of 
ZZ o»eX vea'sin^ axon microarrays of the present 
"tion lac, such cloning artifacts, ana thus proves 
greater specificity of signal in gene expression 

!0 — »- further consequence of the clon ed origin of 
prob es on .any EST microarrays is that the individual 

rlhes often have disparate sizes, which can cause the 
probes often r amonq probes on a 

optimal hybridization stringency to vary among P 
„ single microarray. In contrast, as discussed above, 
probes arrayed on the genome-derived single axon 

* vv,« nresent invention can readily De 
' I" tT. ™ distribution in s, ~ - 
range of probe sizes *o greater than ^ 
20 average size, typically no gr 

av erage P^«- of origin from £ully . or partially- 

S pliced message, probes disposed upon EST arrays will often 
Llude multiple axons. The percentage of such exon 

•„a orobes in an EST microarray can be calculated, on 

The glv n species and the average length of the immobilized 
Probes For human genes, the near-complete sequence of 
probes. «oz nature 402 (6761) s489-95 

human chromosome 22, Dunham et al., Mature 
,0 U999). Predicts that human genes average 5.5 exons/gene. 
Kv " I tl probes of 200 - 500 bp, the vast majority of 
human EST licroarray probes include more 

I„ contrast, by virtue of their origin from 
algorithmically identified ORES in genomic sequence the 
35 probes in the genome-derived single exon microarrays of 
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^-i**- of individual exons . Thus, 
present invention can cons 1S t of xndx ? 

in c :r:r :; i^- - - 

Urivtd ^croarray o £ the pressnt consist * or 

5 M -^ZTJrZZ" -iiV achieved 
using EST mlcroarrays. to use the genome-derived single 

c oresent invention to measure 
— miCr ° ar r /I" iol inaividuai exons, which in 

Inferential spiicing to tissue-specific expressron 

Pa " ernS ' Furthermore, the exons that are represented in 
„ EST microarraye are often biased toward the 3' or 5' end of 
Z: respective genes, since ^ 

for disposition on the genome-derived single exon 
,„ mior oarrays of the present invention. 

conversely, the probes provided on the genome- 
aerived single exon microarrays of the present invention 
Really, but need not necessarily, inclu* — 
and/or intergenic seguence that rs absent fro. EST 
2S microarrays. which are derived fro. mature mR»A 

Typically, at least about 50, 60, 70, 80 or 90, of the 
^n including probes on the genome-derived single exon 
mlcroarrays of the present invention include sequence drawn 
Tnooding regions. As discussed above, the addrtronal 

30 P resence of noncoding ^J^^T^^ 
interfere with measurement of gene expr 
h additional opportunity to assay prespliced ^ and ^ 
th us measure such phenomena such as nuclear export control^ 
The genome-derived single exon microarrays of the 
present invention are also guite different from in situ 
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synthesis microarrays, where probe size is severely 
constrained by inadequacies in the photolithographic 

synthesis process. 

Typically, probes arrayed on in situ synthesis 
5 microarrays are limited to a maximum of about 25 bp. As a 
well known consequence, hybridization to such chips must be 
performed at low stringency. In order, therefore, to 
achieve unambiguous sequence-specific hybridization 
results, the in situ synthesis microarray requires 
10 substantial redundancy, with concomitant programmed 

arraying for each probe of probe analogues with altered 
(i.e., mismatched) sequence. 

In contrast, the longer probe length of the 
genome -derived single exon microarrays of the present 
15 invention allows much higher stringency hybridization and 
wash. Typically, therefore, exon- including probes on the 
genome-derived single exon microarrays of the present 
invention average at least about 100, 200, 300, 400 or 
500 bp in length. By obviating the need for substantial 
20 probe redundancy, this approach permits a higher density of 
probes for discrete exons or genes to be arrayed on the 
microarrays of the present invention than can be achieved 
for in situ synthesis microarrays. 

A further distinction is that the probes in in 
25 situ synthesis microarrays typically are covalently linked 
to the substrate surface. In contrast, the probes disposed 
on the genome -derived microarray of the present invention 
typically are, but need not necessarily be, bound 
noncovalently to the substrate. 
30 Furthermore, the short probe size on in situ 

microarrays causes large percentage differences in the 
melting temperature of probes hybridized to their 
complementary target sequence, and thus causes large 
percentage differences in the theoretically optimum 
35 stringency across the array as a whole. 
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In contrast, the larger probe size in the 
rnicroarrays of the present invention create lower . 
percentage differences in melting temperature across the 
range of arrayed probes. 
5 A further significant advantage of the 

rnicroarrays of the present invention over in situ 
synthesized arrays is that the quality of each individual 
probe can be confirmed before deposition. In contrast, the 
quality of probes cannot be assessed on a probe -by- probe 
10 basis for the in situ synthesized rnicroarrays presently 
being used. 

The genome -derived single exon rnicroarrays of the 
present invention are also distinguished over, and present 
substantial benefits over, the genome -derived rnicroarrays 

15 from lower eukaryotes such as yeast. Lashkari et al . , 
Proc. Natl. Acad. Sci. USA 94:13057-13062 (1997). 

Only about 220 - 250 of the 6100 or so nuclear 
genes in Saccharomyces cerevisiae — that is, only about 4 
- 5% — have standard, spliceosomal, introns, Lopez et al., 

20 Nucl. Acids Res. 28:85-86 (200.0); Spingola et al . , RNA 

5(2) .-221-34 (1999). Furthermore, the entire yeast genome 
has already been sequenced. These two facts permit the 
ready amplification and disposition of single-ORF amplicons 
on such microarray without the requirement for antecedent 

25 use of gene prediction and/or comparative sequence 
analyses. 

Thus, a significant aspect of the present 
invention is the ability to identify and to confirm 
expression of predicted coding regions in genomic sequence 

30 drawn from eukaryotic organisms that have a higher 

percentage of genes having introns than do yeast such as 
Saccharomyces cerevisiae, particularly in genomic sequence 
drawn from eukaryotes in which at least about 10, 20 or 50% 
of protein- encoding genes have introns. In preferred 

35 embodiments, the methods and apparatus of the present 
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the average number of introns per gene is at 

one. two or three or more. 

After the physioal substrate is prepared, 

.experimental verification of predicted function is 

performed . ^ ^ q£ ^ preeent 

mention, -here the function sought to be identified in 
a ^ is nrotein coding, experimental 

,o 9eno ri:r i p fold of «. 

e -s 4^ ~ — * cid ** idi " t " 

e^ Hants, and in particularly preferred events, 
though hybridization to genome-derived single exon 

15 I! —nXteasured and expressed 

» below . The mRBA source for the 

descried m f-**^ expression is measured can 

reference against which speciri f -ingle 
be d from a .mogeneo us^source, such^g^ 

.11 tvpes as further described in Example 2, infra. 
° " ^ can be prepared by standard technigues, see 
Ausubel et al. and Haniatis et al., or purchased 
-„iiv The mRNA is then typically reverse- 

d Lurce (that in which expression is desired to be 
measured) is reverse transcribed in the presence of 
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. second label, typically a fluorophore, typically 
i rometrically-distinguishable from t h e first label^ A. 

a wher described in Example 2, infra, Cy3 and Cy5 dyes 
^particularly useful in tbese methods. After partial 

, purification - ^^^Z^^ to 
hybridization to the probe array 
standard technics, typically under a 

after wash, microarrays are conveniently scanned 
using a commercial microarray scanning device, such as , a 
^ Scanner (Molecular Dynamics, Sunnyvale, CA) . Data on 
10 Z - rL then passed, with or without interim storage, 
^process 500. where the results (or each probe are 
related to the original sequence. 

Often, hybridization of target material to the 
, 5 ge nome-derived single exon microarray will ^"^"arn 
of the probes thereon as of ™ ~ y * ^ 

is often desirable that the user be able readily 
sufficient quantities of an individual probe, e ther for 
suo equent arrayed deposition upon an additional suppor 

» ftm as Dart of a microarray having a plurality 

as a solitary solid-phase or solution-phase probe, for 

fUrth£r "Ls, in another aspect, the present invention 
25 provides compositions and Kits for the ready 

lucleic acids identical in sequence to, or substantially 
identical in sequence to, probes on the genome-derived 
single exon microarrays of the present invention 

in this aspect, a small quantity of each probe is 
30 disposed, typically without attachment to substrate, in a 
latially-addressable ordered set, typically one per well 

a licrotiter dish. Although a SS well 
can be used, greater efficiency is obtained using higher 
Lsity arrays, such as are provided by 
having 384, 884, «3«. 3456. .144. or 9600 wells, and 
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although microtia plates having Physical depresses 
(wells) are conveniently used, any device that permxts 
addressable withdrawal of reagent from fluidly- 
noncommunicating areas can be used. 
5 in this aspect of the invention, therefore, a 

fluidly noncommunicating addressable ordered set of 
individual probes, corresponding to those on a genome- 
derived single exon microarray, is provided, wrth each 
probe in sufficient quantity to permit amplif icatron, such 
,0 as by PCR. As earlier mentioned, the ORF-specific 

5 . primers used for genomic amplification can have a frrst 
common sequence added thereto, and the ORF-specific 3' 
primers used for genomic amplification can have a second 
different, common sequence added thereto, thus permr ttxng 
15 in this preferred embodiment, the use of a srngle set of 
and 3' primers to amplify any one of the probes from the 

amplif iable ordered set. 

Each discrete amplifiable probe can also be 
packaged with amplification primers, solutes buff era 
20 etc., and can be provided in dry (e.g., lyophrlrzed, form 
or wet, in the latter case typically with addrtron of 
agents that retard evaporation. 

in another aspect of the present invention, a 
genome-derived single-exon microarray is packaged together 
25 with such an ordered set of amplifiable probes 

corresponding to the probes, or one or more subsets of 
probes, thereon. In alternative embodiments, the ordered 
set of amplifiable probes is packaged separately from the 
genome-derived single exon microarray. 
30 In some embodiments, the microarray and/or 

ordered probe set are further packaged with recordable 
media that provide probe identification and addressing 
information, and that can additionally contain annotation 
information, such as gene expression data. Such -cordable 
35 media oan be packaged with the microarray, with the ordered 
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probe set, or with both. 

If the microarray is constructed on a substrate 
that incorporates recordable media, such as is described in 
international patent application no. WO 98/12559, then 
5 separate packaging of the genome -derived single exon 
microarray and the bioinformatic information is not 

required. t 

The amount of amplif iable probe materxal should 

be sufficient to permit at least one amplification 
10 sufficient for subsequent hybridization assay. 

Although the use of high density genome -derxved 
microarrays on solid planar substrates is presently a 
preferred approach for the physical confirmation and 
characterization of the expression of sequences predicted 
15. to encode protein, other types of microarrays (as herexn 

defined) can also be used. 

Furthermore, as earlier mentioned, experimental 
verification of the function predicted from genomic 
sequence in process 200 can be bioinformatic, rather than, 
20 or additional to, physical verification. 

For example, where the function desired to be 
identified is protein coding, the predicted ORFs can be 
compared bioinf ormatically to sequences known or suspected 

of being expressed. 
25 Thus, the sequences output from process 3 00 (or 

process 200) , can be used to query expression databases, 
such as EST databases, SNP ("single nucleotide 
polymorphism") databases, known cDNA and mRNA sequences, 
SAGE ("serial analysis of gene expression") databases, and 
30 more generalized sequence databases that allow query for 
expressed sequences. Such query can be done by any 
sequence query algorithm, such as BLAST ("basic local 
alignment search tool") . The results of such query - 
including information on identical sequences and 
35 information on nonidentical sequences that have diffuse or 
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focal regions of sequence homology to the query sequence - 
can then be passed directly to process 500 , or used to 
inform analyses subsequently undertaken in process 200, 
process 300, or process 400. 
5 Experimental data, whether obtained by physical 

or bioinformatic assay in process 400, is passed to process 
500 where it is usefully related to the sequence data 
itself, a process colloquially termed 11 annotation" . Such 
annotation can be done using any technique that usefully 
10 relates the functional information to the sequence, as, for 
example, by incorporating the functional data into the 
record itself, by linking records in a hierarchical or 
relational database, by linking to external databases, or 
by a combination thereof. Such database techniques are 
15 well within the skill in -the art. 

The annotated sequence data can be stored 
locally, uploaded to genomic sequence database 100, and/or 
displayed 800. 

The methods and apparatus of the present 
20 invention rapidly produce functional information from 
genomic sequence. Coupled with the escalating pace at 
which sequence now accumulates, the rapid pace of sequence 
annotation produces a need for methods of displaying the 
information in meaningful ways. 
25 FIG. 3 shows visual display 80 presenting a 

single genomic sequence annotated according to the present 
invention. Because of its nominal resemblance to artistic 
works of Piet Mondrian, visual display 80 is alternatively 
described herein as a "Mondrian" . 
30 Each of the visual elements of display 80 is 

aligned with respect to the genomic sequence being 
annotated (hereinafter, the "annotated sequence") . Given 
the number of nucleotides typically represented in an 
annotated sequence, representation of individual 
35 nucleotides would rarely be readable in hard copy output of 
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display 80. Typically, therefore, the annotated sequence 
is schematized as rectangle 89, extending from the left 
border of display 80 to its right border. By convention 
herein, the left border of rectangle 89 represents the 
5 first nucleotide of the sequence and the right border of 
rectangle 89 represents the last nucleotide of the 
sequence . 

As further discussed below, however, the Mondrian 
visual display of annotated sequence can serve as a 

10 convenient graphical user interface for computerized 

representation, analysis, and query of information stored 
electronically. For such use, the individual nucleotides 
can conveniently be linked to the X axis coordinate of 
rectangle 89. This permits the annotated sequence at any 

15 point within rectangle 89 readily to be viewed, either 

automatically - for example, by time-delayed appearance of 
a small overlaid window upon movement of a cursor or other 
pointer over rectangle 89. — or through user intervention, 
as by clicking a mouse or other pointing device at a point 

20 in rectangle 89. 

Visual display 80 is generated after user 
specification of the genomic sequence to be displayed. 
Such specification can consist of or include an accession 
number for a single clone (e.g., a single BAC accessioned 

25 into GenBank) , wherein the starting and stopping 
nucleotides are thus absolutely identified, or 
alternatively can consist of or include an anchor or 
fulcrum point about which a. chosen range of sequence is 
anchored, thus providing relative endpoints for the 

30 sequence to be displayed. For example, the user can anchor 
such a range about a given chromosomal map location, gene 
name, or even a sequence returned by query for similarity 
or identity to an input query sequence. When visual 
display 80 is used as a graphical user interface to 

35 computerized data, additional control over the first and 
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TTdisplayed nucleotide will typically dynamically 
^e as hy use ot standard — - -A- —~ 

t0 ° 13 ' Field 81 of visual display 80 is used to present 
5 the output from process 200, that is, to present the 

"initio prediction o £ * ^ 

i indicate by their X-axis coordinates, the 

hav e function^ ^ ^ bloln£ormatic method or approach 
identifies a plurality of regions having the desired 
15 £unctlon , plurality of rectangle^ = d ^ 

^sr-rrt, identify -^-r 4 

20 rs^d series L rectangles offset vertically fr» those 
representing the results, of the other methods an* 
approaches^ ^ ^ ^ 3 represent ^ 

/n^Hons of a first method of a first 
functional predictions or a n present 
v, predicting function, rectangles 83b represent 

25 r^SLTSL!- of a second method and/or second 
broach for predicting that function, and rectangles 83c 
present the predictions of a third method and/or 
approach. ^ ^ ^ ^ i<Jentl£ied is 

protein coding, field 8! is used to present the 
Iroinformatic prediction of sequences enco rng o n^ 
F or example, rectangles 83a can represent th result f 
GRAIL or GR.IL II . rectangles 83h can 
35 trom GENEFINDER, and rectangles 83c can represent 
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results from DICTION. 

Optionally, and preferably, rectangles 83 
collectively representing predictions of a single method 
and/or approach are identically colored and/or textured, 
, and are distinguishable from the color and/or texture used 
for a different method and/or approach. 

Alternatively, or in addition, the color, hue, 
density, or texture of rectangles S3 can be used further to 
report a measure of the bioinformatic relxabxlxty of the 
,0 prediction. For example, many gene prediction programs 
will report a measure of the reliability of predxctxon. 
Thus, increasing degrees of such reliability can be 
indicated, e.g., by increasing density of shading. Where 
display 80 is used as a graphical user interface, such 
15 .easures of reliability, and indeed all other results 

output by the program, can additionally or alternatively be 
ma de accessible through linkage from individual rectangles 
83, as by time-delayed window ("tool tip" window) , or by 
pointer (e.g., mouse) -activated link. 
_ 0 as earlier described, increased predictxve 

reliability can be achieved by requiring consensus among 
methods and/or approaches to determining function. Thus, 
field 81 can include a horizontal series of rectangles 83 
that indicate one or more degrees of consensus xn 
25 predictions of function. 

Although FIG. 3 shows three series of 
-horizontally disposed rectangles in field 81, display 80 
can include as few as one such series of rectangles and as 
ma ny as can discriminably be displayed, dependxng upon the 
30 number of methods and/or approaches used to predict a gxven 
function. 

Furthermore, field 81 can be used to show 
predictions of a plurality of different functions 
However, the increased visual complexity occasioned by such 
35 display mafces more useful the ability of the user to select 
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function for display. When —V SO is use, as 
I So, user interface for colter query and analysis, 
such function can usefully be indicated and user- 
selectable, as by a series of graphical buttons or tans 

5 (not shown in FIG. 3) . • 

Rectangle 89 is shown in FIG. 3 as including 
interposed rectangle 84. Rectangle 84 represents the 
portion of annotated sequence for which predicted 
fun tional information has been assayed physically r. 
,„ the starting and ending nucleotides of the assayed material 
indicated by the X axis coordinates of the left and right 
.orders of rectangle 84. Rectangle 6S. with optional 
inclusive circles 86 (86a. 86b, and 86c, displays the 
results of such physical assay. 

Although a single rectangle 84 is shown in FIG. 
3 physical assay is not limited to just one region of 
annotated genomic sequence. It is expected that an 
leasing percentage of regions predicted to have function 
b y process 200 will be assayed physically, and that display 
» 80 will accordingly, for any given genomic sequence, have 
In Increasing number of rectangles 84 and 85, representing 
an increased density of sequence annotation. 

Where the function desired to be identified is 
protein coding, rectangle 84 identifies the sequence of the 
25 probe used to measure expression. In embodiments of the 
present invention where expression is measured using 
genome-derived single exon microarrays, rectangle 
identifies the sequence included within the probe 
mobilised on the support surface of the mrcroarray. As 

of additional, synthetic, material incorporated during 
amplification and designed to permit reamplif ication of he 
probe, which sequence is typically not shown in di play 80. 

Rectangle 87 is used to present the results of 
Moinformatic assay of the genomic sequence. For example, 
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wh ere the function desired to be identified is protean 
coding, process 400 can include bioinformatic query of 
expression databases with the sequences predicted in 
P rocess 2 00 to encode exons. »d as earlier 
, belse bioinformatic assay presents fewer constraints than 
d oes Physical assay, often the entire output of process 200 
can be used for such assay, without further subsetting 

Leof by process 300. .Therefore, rectangle S7 typically 
ne ed not have separate indicators therein of regions 
0 submitted for bioinformatic assay; that is rectangle 
typically need not have regions therein analogous to 
rectangles 84 within rectangle 89. 

Rectangle 87 as shown in FIG. 3 deludes smaller 
rectangles 880 and 88. Rectangles 880 indicate regions 
u that returned a positive result in the bioinformatic assay, 
15 "I rectangles 88 representing regions that die . not return 
such positive results. Where the function desired to be 
predicted and displayed is protein coding, rectangles 880 
Indicate regions of the predicted exons that identify 
20 sequence with significant similarity in expression 
databases, such as EST, SHP. SAGE databases » lth 
rectangles 88 indicating genes novel over those identified 
in existing expression data bases. 

Rectangles 880 can further indicate, through 
2 5 color, shading, texture, or the like, additional 
information obtained from bioinformatio assay. 

For example, where the function assayed and 
displayed is protein coding, the degree of shading of 
rectangles 880 can be used tc represent the degree of 
30 sequence similarity found upon query of expression 

Abases. The number of levels of discrimination can be 
as few as two (identity, and similarity, where 
has a user-selectable lower threshold) . ^^'Jl 
m any different levels of discrimination can be indicated as 
35 can visually be discriminated. 
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Where display 80 is used as a graphical user 
interface, rectangles 880 can additionally provide links 
directly to the sequences identified by the query of 
expression databases, and/or statistical summaries thereof. 
5 As with each of the precedingly-discussed uses of display 
80 as a graphical user interface, it should be understood 
that the information accessed via display 80 need not be 
resident on the computer presenting such display, which 
often will be serving as a client, with the linked 
10 information resident on one or more remotely located 
servers. 

Rectangle 85 displays the results of physical 
assay of the sequence delimited by its left and right 
borders . 

15 Rectangle 85 can consist of a single rectangle, 

thus indicating a single assay, or alternatively, and 
increasingly typically, will consist of a series of 
rectangles (85a, 85b, 85c) indicating separate physical 
assays of the same sequence. 
20 where the function assayed is gene expression, 

and where gene expression is assayed as herein described 
using simultaneous two-color fluorescent detection of 
hybridization to genome-derived single exon microarrays, 
individual rectangles 85 can be colored to indicate the 
25 degree of expression relative to control. Conveniently, 
shades of green can be used to depict expression in the 
sample over control values, and shades of red used to 
depict expression less than control, corresponding to the 
spectra of the Cy3 and Cy5 dyes conventionally used for 
30 respective labeling thereof. Additional functional 

information can be provided in the form of circles 86 (86a, 
86b, 86c) , where the diameter of the circle can be used to 
indicate expression intensity. As discussed infra, such 
relative expression (expression ratios) and absolute 
35 expression (signal intensity) can be expressed using. 
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Where display 80 is used as a graphical user 
interface, rectangle 85 can be used as a link to further 
information about the assay. For example, where the assay 
5 is one for gene expression, each rectangle 85 can be used 
to link to information about the source of the hybridized 
mRNA, the identity of the control, raw or processed data 
from the microarray scan, or the like. 

FIG. 4 is rendition of display 80 representing 
10 gene prediction and gene expression for a hypothetical BAC, 
showing conventions used in the Examples presented infra. 
BAC sequence ("Chip seq.") 89 is presented, with the 
physically assayed region thereof (corresponding to 
rectangle 84 in FIG. 3) shown in white. Algorithmic gene 
15 predictions are shown in field 81, with predictions by 

GRAIL shown, predictions by GENE FINDER, and predictions by 
DICTION shown. Within rectangle 87, regions of sequence 
that, when used to query expression databases, return 
identical or similar sequences ("EST hit") are shown as 
.20 white rectangles (corresponding to rectangles 880 in FIG. 
3) , gray indicates low homology, and black indicates 
unknowns (where black and gray would correspond to 
rectangles 88 in FIG. 3) . 

Although FIGS. 3 and 4 show a single stretch of 
25 sequence, uninterrupted from left to right, longer 

sequences are usefully represented by vertical stacking of 
such individual Mondrians, as shown in FIGS. 9 and 10. 

Single Exon Probes Useful For Measuring Gene Expression 

30 

The methods and apparatus of the present 
invention rapidly produce functional information from 
genomic sequence. Where the function to be identified is 
protein coding, the methods and apparatus of the present 
35 invention rapidly identify and confirm the expression of 
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trtTns of genomic serene, that function to encode 

!oteIn A3 a direct result, the methods end apparatus o £ 
L preeent invention rapidly yield large of 
Bingle-exon nucleic acid probee, the majority fro. 

, previously unknown genes, each of which is use u o 
measuring and/or surveying expression, of a spec 
one or more tissues or call types. 

It is. therefore, another aspect of the present 
invention to provide genome-derived single exon nucleic 

0 acid probes useful for gene expression analysis, and 
particularly for gene expression analysis by microarray. 
P using the methods and genome-derived single-exon 

microarrays of the present invention, we have for example 
r dX Identified a large nu*er of unigue 0 RF s from human 

15 Smic seguence. Using single exon probes that encompass 

, j «« a i.rahPd through microarray 
rhese ORFs, we have demonstrated, tnrouy 

nyTidLtion analysis, the expression of of these 

ORFs in lung. 

&s would immediately be appreciated by one of 

20 skill in the art, each single exon probe having 

d Istrable expression in lung is currently aval able^o 
use in measuring the level of its ORF's expression in lung. 

Diseases of the lung are a significant cause of 
human morbidity and mortality. Increasingly, ^enetrc 
25 factors are being found that contribute to 

onset, and/or aggressiveness of most, if not a 11. of these 
Leases, although causative mutations in single gene h 
b een identified for some, these disorders are, for the 

believed to have polygenic etiologies, 
part, belief ^ ^ ^ g% q£ ^ ^ 

population in the United States, making it the seventh- 
ranking chronic condition. The worldwide prevalence of 
asthma has increased more than 30% since the ate 
1970 s, mostly in areas of increased industrialization. The 
35 yearly economic costs (including both direct and indirect 
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costs) are estimated at almost $12 billion doll 

is also one of the most common reasons to seek medical 
treatment, with over 1.5 million emergency room visits, 
500,000 hospitalizations and over 5,500 deaths each year. 
Outpatient visits are estimated at 15 million per year 

Patients with asthma suffer shortness of breath 
accompanied by cough, wheezing, and anxiety. Common 
features of acute asthma attacks include a rapid 
respiratory rate, tachycardia, and pulsus paradoxus. Acute 
attacks can be triggered by environmental factors such as 
allergens, changes in temperature, and exercise; other 
acute exacerbations have no discernible precipitating 
cause. If asthma is not treated, it can be life- 

threatening. 

It is now well known that genetic factors 

predispose to asthma, but the exact nature of this genetic 
component is still imprecise. 

A 1986 human genetic study supported polygenic 
inheritance, Townley, et . a!., J- Allergy Clin. Immun^ 77: 
20 101-107 (1986), and more recent studies have suggested that 
predisposing factors for asthma, if not the disease itself , 
are heritable. Slutsky, J. Clin. Pharmacol. 39: 246-51 

U " 9) In one approach to elaborating the polygenic 
25 contributions to asthma, candidate genes have been 
suggested based upon presumed involvement in the 
physiologic processes known to contribute to the 
asthmatic state. Huss et al., Nurs. Clin. North Am. 35: 

695-705 (2000) . _ , 

In other studies, linkages and/or associations of 
genetic markers with atopy, bronchial hyperresponsiveness 
and/or asthma have been reported in candidate regions, 
including the 6p region, which includes both the HLA 
complex and the Tumor Necrosis Factor a gene (TNF-a) , the 
35 llq region which includes the gene coding for the b sub- 
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W u T 6 oTt h e high-affinity IgE receptor (FcE *D , the T-cell 
receptor a gene on chromosome 14, the 5q region bearing 
numerous candidate genes among which are the interleukxn 
(IL 3 4 5 9, 13> cluster and the b2-adrenergic receptor 
5 gene /the 12q region containing the genes for interferon- 
gamma (IFNg) , a mast cell growth factor (MSP), and an 
insulin-like growth factor (I0F1) . The strongest of 
these linkages are associated with chromosomes 5 and 11. 
Other linkage regions have been reported on chromosomes 6, 
10 7 11 12 and 13. Demenais, The European Network For 
Understanding Mechanisms of Severe Asthma, 3I0MED 2 
Program - European Commission (1998) . 

Linkage regions have also been suggested on chromosomes 3, 
16 and 14. Duffy, D., "Review of Molecular Genetics of 

15 Asthma and Allergy", 

(http : //www2 . qimr . edu . au/davidD/asthma6 . html) . 

As another example, chronic obstructive pulmonary 
disease (COPD) is the fourth most common cause of death in 
the United States. Although cigarette smoking is the most 
20 common cause of COPD, with smokers having a rate 10 to 30 
times higher for developing emphysema than non-smokers, 
genetic factors are thought to play a significant role xn 
susceptibility to COPD ; indeed, only 15-20% of long-term 
cigarette smokers will develop COPD, suggesting that 
25 genetic factors strongly affect outcome. 

COPD includes both chronic bronchitis and 
emphysema, which share similar symptoms and frequently 
coexist. More than 16 million Americans have COPD at a 
cost currently estimated at $30 billion dollars 
30 each year. Chronic obstructive lung disease is 

characterized by a decline in lung function resulting xn 
difficulty in breathing and physiological changes. In 
severe COPD, patients breathe at very high lung volumes,, 
having lost the lung's normal elastic recoil. Because COPD 
does not affect the lung uniformly, ventilation and 
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perfusion distribution is impaired. In areas of the lung 
with low ventilation-perfusion ratios, arterial, hypoxia 
results. This can further lead to pulmonary hypertension, 
right ventricular failure, and, ultimately, tissue 
ischemia, such as coronary artery disease. 

The only confirmed genetic risk factor for COPD 
is the inherited deficiency of alpha 1-proteinase inhibitor 
(familial emphysema) . Familial emphysema accounts for less 
than 5 percent of all cases of COPD, however, and familial 
clustering of lung function and COPD suggest the presence 
of other genetic risk factors. Luisetti et al., Mondaldi 
Arch. Chest Dis. 50:28-32 (1995); Khoury et. al . , Genet 
Epidemiol. 2: 155-66 (1985) . 

Among such additional genetic factors are the 
15 presence of the GC2 allele, which appears to exert a 

protective effect against COPD. Home et . al., Hum. Hered. 
40: 173-76 (1990). Other suspected genetic involvement 
includes genes coding for alphal-antichymotrypsin, 
alpha2-macroglobulin, vitamin D-binding protein and blood 
20 group antigens. Sandford et. al., Eur. Respir. J. 10: 1380- 
91 (1997) . Finally, the form of the enzyme microsomal 
epoxide hydrolase is correlated to susceptibility to 
COPD. Smith et al . , The Lancet 350: 630-33 (1997). It 
remains uncertain, however, whether other loci contribute 
25 to predisposition and aggressiveness of COPD. 

As yet a further example, lung cancer is the 
leading cause of cancer death in both men and women in the 
United States . Although smoking is the primary risk 
factor, genetics plays a known role in susceptibility to 
30 these bronchogenic carcinomas . 

The most common of the bronchogenic carcinomas is 
non-small cell lung cancer (NSCLC) , which accounts for 75% 
Of all primary lung cancers. NSCLCs are divided into 
adenocarcinomas, squamous cell carcinomas, and large cell 
35 carcinomas. Small cell lung cancer (SCLC) comprises 20% of 
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primary lung cancers, and carcinoids make up 5*. Other 
ra re forms- of lung cancer (all totaling less than n) 
include lymphoma, carcinosarcoma, mucoepidermo.d carcinoma, 
m alignant fibrous histiocytoma, melanoma, sarcoma 
and oLtoma. Lung cancer is generally not assorted wxth 
clinical symptoms until late in the course of the disease; 
this late diagnosis is likely to contribute to the poor 5- 

year survival rate of 14%. ' 

Premalignant changes are thought to include a 
number of successive mutations in various growth regulation 
genes. A chromosome 3p deletion, chromosome 9p deletion, 
and P53 gene mutations have been identified in P~™£ 
lesions. Chromosomal abnormalities identified m both SCLC 
and NSCLC include deletions involving chromosomes 3p, 5q, 

15 9 P , hp, wq. ^' Weston et - al " Proc - Nat ' Acad " 

Sci 86- 5099-5103 (1989). For most of these regions, 
suspected loci are tumor suppressor genes. Additionally, 
transforming oncogenes such as Ki-ras, H-ras, *-ras, myc, 
h er2neu, c-kit, bol-2 and cyclin Dl (prad) have alsobeen 
shown to be activated in certain types of bronchogenic 

_ h al cell 27: 467-76 (1981); Cecil 
carcinomas. Perucho et . al., ^ii 

Textbook of Medicine, 21st ed. (2000) . 

Other contributing genetic loci have been 
identified, including a deletion of the phosphatase and 
tensin homolog (PTEN) at 10g23.3. Overexpression of PTEN 
can inhibit invasion in lung cancer cells, and appears to 
downregulate integrin alpha (6) , laminin beta (3) , hepann- 
binding epidermal growth factor-like growth factor, 
urokinase-type plasminogen activator, myb protein B, and 
A]ct2. Hong et. al . , Am. J. Respir. Cell Mol. Biol 23: 355- 
63 (2000) . in a recent study assessing the risk of lung 
cancer from environmental tobacco smoke (ETS) , women who 
were homozygous null for glutathione S-transf erase (GST) -1 
(GSTM1) had a statistically significant greater risk of 
35 developing lung cancer from ETS. Bennett et. al., J. Nat. 
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•0«-» 2009 . 2014 (1999) . The identified genetic 

that _ le , the inC er S titiai iung 

histopathologic features. ILDs compr 

:rjrs£- £ - — — .r uary 

tolerance, and progress ^ ^ 

ILD is estimated to account 

^Genetic factors are Known to contribute to the 
15 „ent of so. ^^^ZZ^s 

CX^il -dree, with unsown etioiog. 

Tde e g sarcoidosis, pulmonary 

„ hi s ;ioc y tosis, w-*--^ 1 -' 

pul-nary aiveoiar proteinosis, and nonspec.f.c 

^"Ts rJS"i stiii undefined poi y genic hasis 
th e etioiogv of sarcoidosis regains enig^ti. hut has iong 
to have a genetic component. Ethnic 

25 ^ filial clustering and .ult generational 

preponderance^ f ami 

i-olvement all PC- ^ 1Bf 707 . 717 (19 97) . Some 

Ryb icKi et. al h ;^ ll a ^ association betwee n susceptibility to 
studies have shown an associ MedL 
, j . HLA type. Nowack et al., Arcn. 

30 sarcoidosis and HLA typ Antigens 50: 

147: 481-83 (1987); Ishihara et. al., Tis 

650-53 (1997). also 
Solent -se diseases inciude, for e X a.pie , Kartagener 
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X fibrocystic pulmonary dyspasia, primary ciliary 

diSeaSe * /i^ived single exon nucleic acid 

The human genome -derived, smgxe 

• oro bes and microarrays o£ the present invention are useful 
' C predicting, diagnosing, grading, staging, monitoring 
^ ^'diseases of — 
diseases with polygenic etiology. With each of the 
Single Lon probes described herein shown to be expressed 
„ I e table levels in human lung, and with about 2 /3 of 
the probes identifying novel genes, the single exon 
llcroarrays of the present invention provide exceptionally 

hiah informational content for such studies. 

high inform^ aiagnosis (inciuding di£ferential 

„ clinically indistinguishable disorders, 

b e hseed upon the quantitative ^^J*^ 
gene expression profile to one 

, s tic of a given lung 

profiles known to be characteristic or g 

cnecific grades or stages thereof. 
M dlBeaSe ' °L on/e— t, the patient gene expression 
prof ile U generated by hybridizing nucleic acids obtained 

, directly from transcripts expressed in the 

di : SC J/ u g to the ^e-derived single exon microarray 
J5 If:;: p re rnlilventiL. Keference profiles are obtained 
similarly by hybridizing nucleic acids fro. individuals 

" ith re^dTrUitatively relating ^ 
pro£iles , without regard to the function 
30 encoded by the gene, are disclosed in WO 9 /50720, 
incorporated herein by reference in its entirety^ 

in another approach, the genome -derived single 
exon probes and microarrays of the present invention can 
' L ed I interrogate genomic »*, rather than pools of 
expressed message; this latter approach permits 
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predisposition to and/or prognosis of lung disease to be 
assessed through the massively parallel determination of 
altered copy number, deletion, or mutation in the patient 1 s 
genome of exons known to be expressed in human lung. The 

5 algorithms set forth in WO 99/58720 can be applied to such 
genomic profiles without regard to the function of the 
protein encoded by the interrogated gene. 

The utility is specific to the probe; at 
sufficiently high hybridization stringency, which 

10 stringencies are well known in the art - see Ausubel et al. 
and Maniatis et al . - each probe reports the level of 
expression of message specifically containing that ORF. 

It should be appreciated, however, that the 
probes of the present .invention, for which expression in 

15 the lung .has been demonstrated are useful for both 

measurement in the lung and for survey of expression in 

other tissues. 

Significant among such advantages is the presence 

of probes for novel genes. 

20 As mentioned above and further detailed in 

Examples 1 and 2, the methods described enable ORFs which 
are not present in existing expression databases to be 
identified. And the fewer the number of tissues in which 
the ORF can be shown to be expressed, the more likely the 

25 ORF will prove to be part of a novel gene: as further 
discussed in Example 2, ORFs whose expression was 
measurable in only a single of the tested tissues were 
represented in existing expression databases at a rate of 
only 11%, whereas 36% of ORFs whose expression was 

30 measurable in 9 tissues were present in existing expression 
databases, and fully 45% of those ORFs expressed in all ten 
tested tissues were present in existing expressed sequence 
databases . 

Either as tools for measuring gene expression or 
35 tools for surveying gene expression, the genome -derived 
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single exon probes of the present invention have 
significant advantages over the cDNA or EST-based probes 
that are currently available for achieving these utilities. 
The genome-derived single exon probes of the 

5 present invention are useful in constructing genome-derived 
single exon microarray s , the genome-derived single exon 
microarrays. in turn, are useful devices for measuring and 
for surveying gene expression in the human. 

Gene expression analysis using microarrays - 

,„ conventionally using microarrays having probes derived from 
expressed message - is well-established as useful in rhe 
biological research arts (see Lockhart et al. Nature 405. 
827-836). 

Microarrays have been used to determine gene 
15 expression profiles in cells in response to drug treatment 
(see for example, Kaminski et al.. "Global Analysis of 
Gene Expression in Pulmonary Fibrosis Reveals Distinct 
Programs Regulating Lung Inflammation and Fibrosis, " Proc. 
Na tl Acad. Sci. USA 97(4):1778-83 (2000); Bartosiewicz et 
20 al "Development of a Toxicological Gene Array and 

Quantitative Assessment of This Technology," Arch. Bioche,. 
Biophys. 376(1) =66-73 (2000)), viral infection. (see for 
example, Geiss et al., "Large-scale Monitoring of Host Cell 
Gene Expression During HIV-1 Infection Using cDNA 
25 Microarrays," Virology 266 (1) : 8-16 (2000)) and during cell 
processes such as differentiation, senescence and apoptoszs 
(see for example, Shelton et al . , "Microarray Analysis of 
Replicative Senescence," Curr. Biol. 9(17):939-45 (1999); 
Voehringer et al., "Gene Microarray Identification of Redox 
30 and Mitochondrial Elements That Control Resistance or 
Sensitivity to Apoptosis, " Proc. Natl. Acad. Sci. USA 

97(6) :2680-5 (2000) ) . 

Microarrays have also been used to determine 
abnormal gene expression in diseased tissues (see, for 
35 example, Alon et al., "Broad Patterns of Gene Expression 
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Revealed by Clustering Analysis of Tumor and Normal Colon 

Tissues Probed by oligonucleotide Arrays," Proc. Natl. 

Acad soi. USA 96(121.6745-50 (1999).- Perou at al., 

.Distinctive Gene Expression Patterns in Human Mammary 

5 Epithelial Cells and Breast Cancers, Proc. Natl. Acad. Scr. 

0» 96(16, ,9212-7 .1999,, Wang at al . , .'Identificat.on of 

Genes Differentially Over-expressed in Lung Squamous Cell 

Carcinoma using Combination of cDNA Subtraction and 

Microarray Analysis," Oncogene 19 (12, :1519-2 8 (2000 ; 

,0 "Whitney et al., "Analysis of Gene Expression in Multiple 

sclerosis Lesions Using cDNA Microarrays , Ann. Neurol 

46(3) -425-8 (1999)), in drug discovery screens (see, for 

example, Scherf et al., "A Gene Expression Database for the 

Molecular Pharmacology of Cancer," Nat. Genet. 24(21 :236-44 

15 (2 000,) and in diagnosis to determine appropriate treatment 

, fnr examole, Sgroi et al., "In vivo Gene 
strategies (see, for exampre, 

Expression Profile Analysis of Human Breast Cancer 
Progression, •■ Cancer Res. 59(22, .5656-61 (1999,, . 

in microarray-based gene expression screens of 
20 pharmacological drug candidates upon cells, each probe 

provides specific useful data. In particular, it should be 
appreciated that even those probes that show no change rn 
expression are as informative as those that do change, 
serving, in essence, as negative controls. 
25 For example, where gene expression analysis is 

U sed to assess toxicity of chemical agents on cells the 
failure of the agent to change a gene's expression level rs 
evidence that the drug likely does not affect the pathway 
of which the gene's expressed protein is a part. 
,0 Analogously, where gene expression analysis is used to 

assess side effects of pharmacological agents - whether rn 
iead compound discovery or in subsequent screening of lead 
compound derivatives - the inability of the agent to alter 
a gene-s expression level is evidence that the drug does 
35 not affect the pathway of which the gene's expressed 
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Pr0tan "wo 99/58720 provides methods for quantifying the 

, f(Sdness of a first and second gene expression profile 
H oH daring the relatedness of a plurality of gene 
5 Session profiles. The methods so described permit 
Ztuy infection to be extracted fro™ a greater 

entage of the individual gene expression 
Ion, a microarray than methods previously used in the art. 
Other uses of microarrays are described in 
„ aerhold et a!., Trends Sioche*. Sal. 24 (S) -.168-173 (»») 
10 uern0i . , m f-|-n -429-436 (1999); 

and Zweiger, Trends Biotecimol. 17(11) .429 

The invention particularly provides genome- 
derived single-exon probes known to be expressed in lung^ 
u tL individual single exon probes can be provided 

ln th e for, of substantially isolated and purified nucleic 
acid, typically, but not necessarily, in a quantity 
sufficient to perform a hybridisation reac «»• . 
such nucleic acid can be in any for. 
20 hybridisable to the message that contains the probe's 
such as double stranded BH*. single-stranded ^ 
Elementary to the message molecules 

25 additionally include either nonnative »»***" — 

alternative internucleotide linkages, or both, so long as 
oXmentary binding can be obtained Per example, probes 
c an include phosphorothioates, methylphosphona t«> 
mor pholino analogs, and peptide nucleic acids H* . as 
30 described, for example, in U.S. Patent Nos 

5 235 033; 5,166,315, 5,217,866; 5,184,444; 5,861,250 

Usefully, however, such probes are provided m a 
f or m and quantity suitable for amplification, where the 
amplified product is thereafter to be used in the 
hybridization reactions that probe gene expression. 
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WO0UWW3 llv such probes are provided in a for. and quantity 
Typically, such pron ^ 

suitable lor amplification by PCR or oy ...... to 

salification technique. One such technique additional to 
Z i rlllinq circle amplification, as is described inter 
5 Lis, in U.S. Patent Nos. 5,854,033 and 5,714,320 and 

international patent publications WO _ 
„0 00/15779. As is well understood, where the probes are 
Z he provided in a form suitable for amplification, the 
ranee of nucleic acid analogues and/or intemucleotide 
w linkages will be constrained by the retirements and nature 
of the amplification enzyme. 

Mhere the probe is to be provided in form 
suitable for amplification, the quantity need not be 
sufficient for direct hybridisation for gene expression 
15 analysis and need be sufficient only to function as an 
aerification template, typically at least about 1, 10 or 

100 pg or more. 

Each discrete amplifiable probe can also be 

packaged with amplification primers, either in a single 
20 exposition that comprises probe template «*P™^ 
in a kit that comprises such primers separately packaged 
therefrom. As earlier mentioned, the ORF-specific 
5 primers used for genomic amplification can have 
common sequence added thereto, and the ORF-specific 3 
25 pTmers used for genomic amplification can have a second, 
M ^ rent, common sequence added thereto, thus permitting, 
in this embodiment, the use of a single set of 5< and 3 
primers to «*lify any one. of the probes. The probe 
composition and/or kit can also include buffers, en 2 yme, 
so etc , required to effect .amplification. 

As mentioned earlier, when intended for use on a 
g enome-derived single exon microarray of the 
invention, the genome-derived single exon probes of the 
Present invention will typically average at least ab ut 
35 10 0, 200. 300. 400 or 500 bp in length, including (and 
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T£U but not * a b out, the ORF 

r*™, when intended for use on a ^derived 
si ngle axon microarray of the present ^ent n he 
genome-derived single exon probes o£ the present 

5 lill typically not oontain a aetectahle label 

When intended for use in solution phase 

, . „,„„ however - that is, for use in a 
hybridization, however £lrst 

r^Z^L^-i the target .ay 

relaxe a. ana such p^ ^ only functional oonstraint that 
dicC ates the minimum size of such probe is that each- such 

» v,» caoable of specifically iaentifying an a 
„ probe must be capable o P ^ ^ ^ ^ 

hybridization reaction the exon ^able 
* a probe of as little as 17 nucleotides is capable 

of unlguely identifying its cognate seguence in the human 
Z For hybridization to expressed message - a subset 
, of target seguence that is much reduced in complexity as 
20 compared to Tenomic sequence - even fewer nucleotides are 

^ 'Z^X^s of the present invention 

aQ 9n 25 or 50 bp or ORF, or more. in 
can include as few as 20, 25 or f 
25 particular embodiments, the ORF sequences are given in « 

ID ,OS. - -.001. — C o be 

o n 19 614 The minimum amount of - ORF require 
ludea in thi'probe of the present invention in order to 

provide specific signal in ^J^T^L^ 
30 mic roarray-baeed ^id-tions can re d ly ^ fcy 

for each of ORF SEQ ID NOS. 12,615 2b, uu 
Routine experimentation using standard high stringency 

conditions^ ^ ^ 
inter alia, in Xusubel et al. and Maniatis et a!. For 
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.nicroarray-based hybridisation, standard high stringency 
conditions can usefully be 50% for.an.ide. 5X SSC, 0.2 pg/ul 
poly (dA), 0.2 ug/pl human ctl DNA, and 0.5 % SDS, in a 
humid oven at 42°C overnight, followed by successive washes 
5 of the microarray in IX SSC, 0.2% SDS at SS°C for 5 

mi nutes. For solution phase hybridization, standard hrgh 
stringency conditions can usefully be aqueous hybrzdizatron 
at 65°C in 6X SSC. Lower stringency conditions, suitable 
,0 for cross-hybridization to mRNA encoding structurally- and 
functionally-related proteins, can usefully be the same as 
the high stringency conditions but with reduction rn 
temperature for hybridization^ and washing to room 
temperature (approximately 25°C) . 

When intended for use in solution phase 
hybridization, the maximum size of the single exon probes 
of the present invention is dictated by the proxrmrty of 
other expressed exons in genomic DNA: although each szngle 
exon probe can include intergenic and/or intronic material 
20 contiguous to the ORF in the human genome, each probe of 
the present invention will include portions of only one 

expressed exon. 

Thus, each single exon probe will include no more 

than about 25 kb of contiguous genomic sequence, more 
25 typically no more than about 20 kb of contiguous genomic 
sequence, more usually no more than about 15 kb, even more 
usually no more, than about 10 kb. Usually, probes that are 
• maximally about 5 kb will be used, more typically no more 

than about 3 kb. _ 
3Q It will be appreciated that the Sequence Listing 

appended hereto presents, by convention, only that strand 
of the probe and ORF sequence that can be directly 
translated reading from 5- to 3- end. As would be well 
understood by one of skill in the art, single stranded 
35 probes must be complementary in sequence to the ORF as 
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present in an mRNA; it is well within the skill in the art 
to determine such complementary sequence. It will further 
be understood that double stranded probes can be used in 
both solution-phase hybridization and microarray-based 
5 hybridization if suitably denatured. 

Thus, it is an aspect of the present invention to 
provide single-stranded nucleic acid probes that have 
sequence complementary to those described herein above and 
below, and double -stranded probes one strand of which has 
10 sequence complementary to the probes described herein. 

The probes can, but need not, contain intergenic 
and/or intronic material that flanks the ORF, on one or 
both sides, in the same linear relationship to. the ORF that 
the intergenic and/or intronic material bears to the ORF in 
15 genomic DNA. The probes do not, however, contain nucleic 
acid derived from more than one expressed ORF. 

And when intended for us/e in solution 
hybridization, the probes of the present invention can 
usefully have detectable labels. Nucleic acid labels are 
20 well known in the art, and include, inter alia, radioactive 
labels, such as 3 H, 32 P, 33 P, 3S S, 125 I, 131 I; fluorescent 
labels, such as Cy3, Cy5, Cy5.5, Cy7, SYBR® 

Green and other labels described in Haugland, 
Handbook of Fluorescent Probes and Research Chemicals, 7th 
25 e'd., Molecular Probes Inc., Eugene, OR (2000)., or 

fluorescence resonance energy transfer tandem conjugates 
thereof; labels suitable for chemiluminescent and/or 
enhanced chemiluminescent detection; labels suitable for 
ESR and NMR detection; and labels that include one member 
30 of a specific binding pair, such as biotin, digoxigenin, or 
the like. 

The probes, either in quantity sufficient for. 
hybridization or sufficient for amplification, can be 
provided in individual vials or containers. 
35 Alternatively, such probes can usefully be 
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"ea as a plurality - — — — - 

single exon probes. 

When provided as a collection of plural 
individual probes, the probes are typically made availab* 
liable form in a spatially-addressable ordered set, 
5 ^al r/er well Of a microtiter dish. ,1— a 95 
^1 microtiter plate can be used, greater efficiency is 
obtained using higher density arrays . 

If as earlier mentioned, the ORP-specifxc 

M 5 . priors used for genomic ^"^^l 
common sequence added thereto, and the ORF specif 
primers used for genomic amplification had a second, 
different, common sequence added thereto, a single set of 
5 aid I primers can be used to amplify all of the probes 

15 ~ ^ Turrnecir genome-derived single exon 
pr obee can usefully include - J^C 

tor the common attribute of expression 

in such defined subsets, typically at least 50, 
60 75, SO, 85, 90 or 95%. or more of the probes will be 
ohosen by their expression in the defined tissue or cell 

tYPS ' The single exon probes of the present invention, 

as well as fragments of the single exon probes comprising 

electively hybridise portions of the probe OKP canbe 
Ld to obtain the full length d» that includes the 
by (i) screening of cDKA libraries; (ii) rapid 
amplification of cDNA ends (•»«.), or 
conventional means, as are described, inter alia, in 

"~ 1 ^^"ITZ* - «- -sent invention to 
p ,ovide genome-derived single exon nucleic acid microarrays 
useful for gene expression analysis, where the term 
^Icroarray' has the meaning given in the definitional 
section of this description, supra. 
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w0 01/86003 ■ i ar i v provides genome- 

The invention particularly previa g 

• h si nale-exon nucleic acid microarrays comprising a 
derived single exon tressed in human lung. 

pl urality of probes known ^J^TiL*** 
In preferred embo « : g -r~ yS uprising a 

ID N0S ' : — expression - 

ge nome-derived single exon microarrays ^ 
10 physical informational ^ ^ rf 

singl a exon ^ in the tested tissue, 

probes known to be express a aive n microarray 

« . fixed probe density, £or example, a given » 

face area of the defined subset g enome-derived single 

surface area q£ expresslon 

B exon microarray «^ « at . given probe density, the 
measurements^ *»« at v ^ Mn ^ oWained £rom 

same number of express Alte rnatively, at a 

a smailer substrate eurfac ~ea ^ ^ 

fixed probe densrty and fixed s mty ln 

" provided "dundantiy ^ ng g . ^ 
sig nal -asurement fo any J P^ ^ ^ ^ ln the 

a hi 9 her percentage of prob detect ion means 

j n==i,s the dynamic ran g e or tne 
ZZ a^ustld to rereal finer levels discrimination among 

M ^ ^^S^^ described with respect to 

the ir utility as probes of gene ^^^^ 

4r,r.luded on a genome -derived. smy-L<= 
probes to be included g ^ ^ ^ ^ , 

microarray, each of tne nu 
30 . 12 , 614 contains an open- reading frame, f « 

respectively in SBQ B» « . , 12. "5 ^00 _ ^ 
a protein domain. Thus, each of SEQ ID H _ 

* or that portion thereof in SEQ ID inu* 
be used, or that ^ ^ ^ ^ 

25,001 used, to expres P al> and 

35 vitro recombinant techniques. See Aus 
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ManiaClS '^tionally. hits are available commercially 
that readUy permit such nucleic acids to be expressed as 
Z e n Tn lacterial cells, insect cells, or mammaHan 
protein in . protein Expression * 

S CellS ; VlZtl o^Z, Laboratories, Palo alto, «*, 
Purrtrcation system clonTech Laboratories, Palo 

^"° n pur: ^cation « System, „ew 

, — ^rrHortt Peptides can be chemically 

raized using commercial peptide synthesizing equipment 
synthesized usl ^ procedures are described, inter 

and well known techniques. ^ 
T- in Chan et al. (eds.l, Psoc^olid^se^eptide 
aIia , rn Ch " 1 ^ t ^^^ ! .practical Approach Serxes, 

,5 (Paper) >, Oxford uni . . , __, synthesis 

01996 37 2 «, ; .ones, ^^ff^^T^s. 

(springer Laboratory,, Springer Verlag 

K aT - 1993) (ISBN: 0387564314). 

20 , D ecember 1,93, ^ q£ ^ imentlon 

to orovide peptides comprising an amino acid sequence 

7tfd from SEQ ID NOS.: 12,615 - 25,001. Such am.no 
translated ft-. 0 ^ ^ 25 0q2 . „, 012 . 

acid sequences are set synt hesized peptide of 

25 any such recombrnan ^e a ^ ^ ^ ^ 

at least . and P"'«^ „ and usad to generate 

can be conjugated to a carrier P 

v. „ *h*fc recognizes the peptide. Thus, it is a 
antibody that recogni ^ ^ ^ 

further aspect of the inventio P ^ ^ 

30 have at least 8, preferably at least 
acids. 

The following examples are offered by way of 
illustration and not by way of limitation. 
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of Sin 9 le Kxon Microarrays from ORFs Predicted 
in Human Genomic Sequence 

5 Bioinfc^matic^Resu^ ces 
All human BAC sequences in tewer cx 

nn a five month period 
th at had been access loned » ^ ^ ^ 

mediately preceding this study ^ 
PprlBank This corresponds to -2200 cxone 
0 To seance, or approximately 10% of the human genome. 

After masking repetitive elements using the 
M ooram CROSS MATCH, the sequence was analyzed for open 
: all: frame; using three separate gene -ding -grams. 
The three programs predict genes using independent 
15 r/hmic methods developed on independent training sets, 
GRAIL uses a neural network, GENEF INDER uses a hidden 
^Lf model, and BICTION, a program P-P-etary 
Genetics institute, operates according to a different 
wristic The results of all three programs were used to 
"rte " prediction matrix across the segment of genomic 



20 

DNA. 



The three gene finding programs yielded a range 

of results GRAIL identified the greatest percentage of 
of results. ^ 2% q£ ^ data 

genomic sequence as putati a nrcTION 

25 analysed. GENEFINDER was second, calling l«,and DICTION 
yielded the least putative coding region, with 0.8. 
genomic sequence called as coding region 

The consensus data were as follows. GRAIL and 

j „„ n 7* of genomic sequence, GRAIL and 
GENEFINDER agreed on 0.7. o£ gen 4 

,„ DICTION agreed on 0.5% of genomic sequence, and the thr 
Loether agreed on 0.25% of the data analysed. 
IT:, 0 l genomic sequence was identified hy 

all three of the programs as containing putative coding 

re91011 ' ORFs predicted hy any two of the three programs 



35 

73 



PCT/US01/00665 

v >°'> mim % „ orted in to 'gene bins- using two 

(-consensus ORFs") were assorted into g 

criteria- (1) any 7 consecutive exons within a 25 to window 
criteria. 11) y u contributing to a 

„cre Placed together in ^ ^ ^ ^ ^ 

.ingle gene and «) all 0 cQntributing to . 3lng le 

S placed together m a bin as u» y 

gene if fewer than 7 exons were found within the 25 Wo 

window . 

,„ — The largest ORF from each gene bin that did not 

oan repetitive sequence was then chosen for amplification 
al were a consensus ORFs longer than =00 bp. This method 
a proLate. one exon per gene.- however, a numher of genes 
were found to be represented by multiple elements 

Previously, we had determined that fragments 

Rifled glass surface of the slides used as support 
substrate for construction of microarrays , 
amplioons were designed in the present experiments 
20 approximate 500 bp in length. 

Accordingly, after selecting the large , 
g ene bin. a 500 bp fragment of sequence centered on the OR* 
L passed to the primer picxing software, PRIMER3 
(available online for use at 
•>< hrtr>.//www-genome.wi.mit.eau/cgi " 'r 

additional sequence was commonly added to each ORR-umque 

rimer, and a second, different, additional sequence was 
commonly added to each ORF-unique 3' primer, to permit 
: suTequL reamplif ication of the amplicon using a single 
30 ""universal" S< and 3 . primers, thus immortalizing, 
h amplicon. The addition of universal priming sequences 
Tso facilitates sequence verification, and can be used to 
add a cloning site should some ORFs be found to warrant 
rurther study. ^ ^ ^ ^ £roro ^ 
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- — • ^ — ncea usin9 th :- . 

^ *° - iidate the identity of the amp 

to be spotted in the microarray. 

Primers were supplied by operon Technologies 
, (Alameda, CA, . PCR amplification was performed by standard 
echnigues using human genomic » .Clontech Palo Alto, 
OS, as template. Each PCR product was verified by SYBR 
gre en (Molecular Probes, Inc., Eugene, OK, staining of 
agarose gels, with subsequent imaging by Fluonmager 
„ (Molecular Dynamics, Inc.. Sunnyvale, <M. PCR ^ 
amplification was classified as successful if a single 

aPPe " ed ' the success rate for amplifying ORFs of interest 

directly from genomic DNA using PCR was approximately 75%. 

15 fig 5 graphs the distribution of predicted ORF (exon 

" „gth Ind distribution of amplified PC, P^-s wi h 0RF 

lenoth shown in red and PCR product length shown in blue 
length shown A1 though the range 

(which may appear black in the rigu , 

o£ ORF sizes is readily seen to extend to beyond 900 bp, 
» heTean predicted axon size was only 229 bp, with a median 
size of 150 bp <n=9«S, . With an average ampUcon size of 
475 ± 25 bp, approximately 50% of the average PCR 
\,olification product contained predicted coding region, 
^ tie remaining 50% of the amplicon containing either 

* *~ n::irrs:;::: e p— - — 

5 „0 bp it was found that long exons had a higher PCR 
a ilure rate. To address this, the bioinformatics process 
Td ulted to amplify 1000. 1500 or ,000 bp fragments 
3. from exons larger than 500 bp. This proved the rate 
successful amplification of exons exce eding 500 bp 
constituting about ,.2% of the exons predicted by the gene 
finding aigori^ ^ probes disposea the 
35 array (90% of those that successfully PCR amplified) were 
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WHMN . e in both the £orward and 

:rr 2 :» j:** ^ „ r 

— Pr s r clones «*» yielded very poor - 

• ^sults The reasons for this are unclear, 

« 2 inclusion of vector and host contaminate » some 

, submitted sr«— - — - — \ th 

' n^ing coding regions could theoretically interfere «th 

■ ratios were not significantly affected by the presence 
U noncoding sequence. The variation in exon srse was 
slmmrly found not to affect differential expressron 

onlficantly,- however, variation in exon size was 
ratios srgnif rcanciy, i_ tMM itv (data not 

observed to affect the absolute srgnal intensity 

ShOWn ' ' The 350 MB of genomic DNA was, by the above- 
20 aescribed process, reduced to 3,30 discrete probes, which 
were spotted in duplicate onto glass slades usrng 
eoLrcially available instrumentation (Mrcr^rray Oen!! 

Sootter and/or MicroArray (Will Spotter, Molecular 
Spotter an / Each slide additl onally 

25 Dynamics, Inc., faunnyvdx , 

included either IS or 32 E. coli genes, the average 
^ridization signal of which was used as a measure of 

backcrround biological noise. 

Each of the probe sequences was BLASTed against 

GenBank (May 7, 1999 release 2.0.9). amplified) 
One third of the probe sequences (as amplified, 

(RIAST Expect ("E" ) values less 
produced an exact match (BLAST Exp v 

' . e-xoo) to either an EST (20% of sequences) or a known 
than 1 e J to furthe r 22% of the probe 

35 mRNA (13% of sequences) . A further 
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sequences showed some homology to a known EST or mRNA 
(BLAST E values from 1 e' 5 to 1 . The remainxng 45% of 

the probe sequences showed no significant sequence homology 
to any expressed, or potentially expressed, sequences 

5 present in public databases. 

All of the probe sequences (as amplified) were 
then analyzed for protein similarities with the SwissProt 
database using BLASTX , Gish et al . , Nature Genet. 3:266 
(1993) . The predicted functional breakdowns of the 2/3 of 

10 probes identical or homologous to known sequences are 
presented in Table 1. 



Table 1 



15 



Function of Pred icted ORFs As Deduced From Comparative 
Sequence Analysis 




As can 



be seen, the two most common types of 
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genes were transcription factors and receptors, making up 
2.2% and 1.8% of the arrayed elements, respectively. 



5 EXAMPLE 2 

Gene Expression Measurements From Genome -Derived Single 
Exon Microarrays 



10 
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The two genome -derived single exon microarrays 
prepared according to Example 1 were hybridized in a series 
of simultaneous two-color fluorescence experiments to (1) 
Cy3-labeled cDNA synthesized from message drawn 
individually from each of brain, heart, liver, fetal liver, 
placenta, lung, bone .marrow, HeLa, BT 474, or HBL 100 
cells, and (2) Cy5- labeled cDNA prepared from message 
pooled from all ten tissues and cell types, as a control in 
each of the measurements. Hybridization and scanning were 
carried out using standard protocols and Molecular Dynamics 
20 equipment . 

Briefly, mRNA samples were bought from commercial 
sources (Clontech, Palo Alto, CA and Amersham Pharmacia 
Biotech (APB) ) . Cy3-dCTP and Cy5-dCTP (both from APB) were 
incorporated during separate reverse transcriptions of 1 ug 

25 of polyA + mRNA performed using 1 ug oligo (dT) 12-18 primer 
and 2 jig random 9mer primers as follows. After heating to 
70°C, the RNArprimer mixture, was snap cooled on ice. After 
snap cooling on ice, added to the RNA to the stated final 
concentration was: IX Superscript II buffer, 0.01 M DTT, 

30 lOOuM dATP, 100 uM dGTP, 100 uM dTTP, 50 uM dCTP, 50 uM 
Cy3-dCTP or Cy5-dCTP 50 pM, and 200 U Superscript II 
enzyme. The reaction was incubated for 2 hours at 42°C. 
After 2 hours, the first strand cDNA was isolated by adding 
1 U Ribonuclease H, and incubating for 30 minutes at 37°C. 

35 The reaction was then purified using a Qiagen PCR cleanup 
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column, increasing the number of ethanol washes to 5.. 
probe was elated using 10 mM Tris pH 8.5. 

Using a spectrophotometer, probes were measured 
for dye incorporation. Volumes of both Cy3 and Cy5 ct>»A 
, corresponding to 50 pmoles of each dye were then dr red in a 
Speedvac, resuspended in 30 pi hybridization solution 
containing 50* forbids, 5X SSC, 0.2 pg/pl poly(«). 0.2 
pg/pl human ctl DHA, and 0.5 % SDS. 

Hybridizations were carried out under a 
,0 coverslip, with the array placed in a humid oven at « C 
overnight. Before scanning, slides were ^ 
„ ,» ana at 55°C for 5 minutes, followed by 0.1X SSC. 0.2. 
D 2 S at 55^ for ,0 minutes. Slides were briefly dipped in 
water and dried thoroughly under a gentle stream of 

15 nitr09eI " Slides were scanned using a Molecular Dynamics 
S en3 scanner, as described. Schena <ed.), Microarrav 
I||lld ^wo. ^Technology, Eaton Publishing 
^^^T^ivision »„„„> ««, 

M """'"Although the use of pooled cOWA as a reference 
permitted the survey of a large number of tissues, it 
permittee relative gene expression, 

attenuates the measurement of relaci s 
since every highly expressed gene in the tis u eel 

25 .pacific fluorescence channel will be present to a 
at least !0V in the control channe 1 . c u o 
both signal and expression ratios (the 

• i «« at least three times greater than biologxcal 

:::::: :;: «r « * - — * m * 

^ B ' ^"hrratrexpression signal for these probes 
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was then plotted as function of tissue or cell type, and xb 

presented in FIG. 6. 

FIG 6 shows the distribution of expression 
across a panel of ten tissues. The graph shows the number 
5 of sequence-verified products that were either not 
expressed ("0"), expressed in one or more but not all 
tested tissues ("1" - "9;>, and expressed in all tissues 

tested ("10") • 

Of 9999 arrayed elements on the two microarrays 

10 (including positive and negative controls and "failed" 

products), 2353 (51%) were expressed in at least one tissue 
or cell type. Of the gene elements showing significant 
signal - where expression was scored as "significant" if 
the normalized Cy3 signal was greater than 1, representing 

15 signal 5-fold over biological noise (0.2) - 39% (991) were 
expressed in all 10 tissues. The next most common class 
(15%) consisted of gene elements expressed in only a sxngle 
tissue. 

The genes expressed in a single tissue were 
20 further analyzed, and the results of the analyses are 

compiled in FIG. 7. 

FIG. 7A is a matrix presenting the expression of 
all verified sequences that showed expression greater than 
3 in at least one tissue. Each clone is represented by a 
column in the matrix. Each of the 10 tissues assayed is 
represented by a separate row in the matrix, and relatxve 
expression of a clone in that tissue is indicated at the 
respective node by intensity of green shading, with the 
intensity legend shown in panel B. The top row of the 
30 matrix ( "EST Hit") contains "bioinf ormatic" rather than 
"physical" expression data - that is, presents the results 
returned by query of EST, NR. and SwissProt databases using 
the probe sequence. The legend for "bioinf ormatic 
expression" (i.e., degree of homology returned) is 
35 presented in panel C. Briefly, white is known, black is 
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novel, with gray depicting nonidentical with significant 
homology (white:' E values < le-100; gray: E. values from le- 
05 to le-99; black: E values > le-05) . 

As FIG. 7 readily shows, heart and brain were 
5 demonstrated to have the greatest numbers of genes that 
were shown to be uniquely expressed in the respective 
tissue. In brain, 200 uniquely expressed genes were 
identified; in heart, 150. The remaining tissues gave the 
following figures for uniquely expressed genes: liver, 100; 
10 lung, 70; fetal liver, ISO; bone marrow, 75; placenta, 100; 
HeLa, 50; HBL, 100; and BT474, 50. 

It was further observed that there were many more 
"novel" genes among those that were up-regulated in only 
one tissue, as compared with those that were down- regulated 
15 in only one tissue. In fact, it was found that ORFs whose 
expression was measurable in only a single of the tested 
tissues were represented in sequencing databases at a rate 
of only H%, whereas 36% of the ORFs whose expression was 
measurable in -9 of the tissues were present in public 
20 databases. As for those ORFs expressed in all ten tissues, 
fully 45% were present in existing expressed sequence 
databases. These results, are not unexpected, since genes 
expressed in a greater number of tissues have a higher 
likelihood of being, and thus of having been, discovered by 
25 EST approaches. 

Comparison of Signal from Known and Unknown Genes 

The normalized signal of the genes found to have 
high homology to genes present in the GenBank human EST 
30 database were compared to the normalized signal of those 
genes not found in the GenBank human EST database. The 
data are shown in FIG. 8. 

FIG. 8. shows the normalized Cy3 signal intensity 
for all sequence-verified products with a BLAST Expect 
35 («E") value of greater than le-30 (designated "unknown") 
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upon query of existing EST, NR and SwissProt databases, and 
shows' in blue the normalized Cy3. signal intensity for all 
sequence -verified products with a BLAST Expect value of 
' less than le-30 ("known"). Note that biological background 
5 noise has an averaged normalized Cy3 signal intensity of 
0.2. 

As expected, the most highly expressed of the 
ORFs were "known" genes. This is not surprising, since 
very high signal intensity correlates with very commonly- 

!0 expressed genes, which have a higher likelihood of being 
found by EST sequence. 

However, a significant point is that a large 
number of even the high expressers were "unknown". Since 
the genomic approach used to identify genes and to confirm 

15 their expression does not bias exons toward either the 3 1 
or 5' end of a gene, many of these high expression genes 
will not have been detected in an end- sequenced cDNA 
library. 

The significant point is that presence of the 
20 gene in an EST database is not a prerequisite for 
incorporation into a genome -derived microarray, and 
further, that arraying such "unknown" exons can help to 
assign function to as-yet undiscovered genes. 

25 Verification of Gene Expression 

To ascertain the validity of the approach 
described above to identify genes from raw genomic 
sequence, expression of two of the probes was assayed using 
reverse transcriptase polymerase chain reaction (RT PCR) 
30 and northern blot analysis. 

Two microarray probes were selected on the basis 
of exon size, prior sequencing success, and tissue-specific 
gene expression patterns as measured by the microarray 
experiments. The primers originally used to amplify the 
35 two respective ORFs from genomic DNA were used in RT PCR 
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WO Tn!t a panel of tissue-specific cDNAs (Kapid^can gene 
"jrssicnTanei - -an cDNAs, (OriOene Technologies, 

Inc . , Rickville , MD) . 

Sequence M,079300_l was shewn by microarray 
,. hybridization to be present in cardiac tissue, and sequence 
^03X734 X was shown by microarray experiment to be present 
^cental tissue (data not shown, . RT-PC* on these t» 
fences confirmed the tissue-specific gene expression a 

0 ac ascertained by the presence of 
measured by microarrays , as ascertained y v 

M a correctly sized PGR product fro, the respective tissue 
type cDNAs. 

Ciearly, ali microarray results cannot, and 
indeed shouXd not. be confirmed by independent 
methods, or the high throughput, highly parallel advantages 

15 of microarray hybridization assays will be lost. However 
in addition to the two RT-PCR results presented above, the 
observation that 1/3 of the arrayed genes exist in 
session databases provides powerful -firmatro^ o the 
power of our methodology - which combines bioinf ormatic 

20 prediction with expression confirmation using 9—-- 

Lived single exon microarrays - to identify novel genes 

from raw genomic data. 

to verify that the approach further provides 
correct characterization of the expression patterns of the 
25 identified genes, a detailed analysis was performed oj ^ the 
m icroarrayed sequences that showed high signal in brain 

For this latter analysis, sequences that showed 
hi gh (normalized, signal in brain, but which showed very 
Z (normalized, signal (less than .... determined to be 
,0 biological noise) in all other tissues, were further 

studied. There were 8, sequences that fit these criteria, 
approximately » of the arrayed element*. The 10 sequences 

showing the highest signal in brain in microarray 
snowing s assigned 

hybridizations are detailed in Table 2, a a 
function, if known or reasonably predicted. 
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Table 2 



— F^ctIo~n of the Most Highly" 
Expressed Genes Expressed Only in Brain 

Licroarray Normal Express! Homology 
Sequence ized on Ratio to EST 
Signal / Present 

in 

GenBank 



AP000047-1 



AC006548-9 



AC007245-5 



L44140-4 



2.3 



1.7 



1.5 



1.2 



+7.7 



High 



High 
High 



High 



+2.0 



High 



Gene Function 
as described byj 
GenBank 



S-100 protein, 
b- chain, Ca 2+ 
binding protein 
expressed in 

central nervous 

system 

Unknown 

Function 



Similar to 
mouse membrane 
gly co -pr o t e in 
MS, expressed 
in central 
nervous system 
Similar to 
amphiphysin, a 
synaptic 
vesicle- 
associated 
protein. Ref 21 
Endothelial 
act in-binding 
protein found 
in nonmuscle 
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AC004689-9 



AL031657-1 







" P 
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172 "m 
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c 
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>rotein 
>hosphatase 
>P2A, neuronal/ 
iownregulates 
activated 
protein kinases 


~TT2 J 


+3.0 r 


High P 


Unknown 
function/ 
Contains the 
anhyrin motif/ 
a common 
protein 
sequence motif 


rirri 


T3T7 


Low ] 


Low homology to 
the 

Synaptotagmin I 
protein in 

rat/present at 

low levels 

throughout rat 

brain 


1 1-0 " 


+2.7 


Low 


" Unknown, very 
poor homology 
to collagen 


1 1.0 




""High 


Protein 
Phosphatase 
PP2A, neuronal/ 
ci n wnr ecru 1 a t e s 
activated 
protein kinases 



Of the ten sequences studied by these latter 
confirmatory approaches, eight were previously known. Of 
these eight, six had previously been reported to be 

85 



10 



PCT/US01/00665 

"Zlnt in the centra! nervous system o, brain. The exon 
: 1 highest signal (AP00317-1, was found 

U -coding an SXO0B Ca" binding protein. » 

L literature to b e highly and uniquely expressed n h 
5 central nervous system. Hermann, tfeurochem. M., 9.1097 

' a number of the brain-specific probe sequences 
.including AC00S548-9, AC0092S6-2, did not have 
any Known human cPNAs in GenBank but did show homo ogy to 
J end mouse cDNAs . Seances AC004689-9 and AC004689-3 
weie both found to be phosphatases present in neurons 
(Millward st al., Trends Biochem. sol. 24(51:186-191 
(1999) ) Two microarray sequences, AP000047-1 and 
AP000086-1 have unknown function, IB. AP000086-1 being 
15 absent from GenBank. Functionality can now be narrowed 
down to a role in the central nervous system for both of 
these genes, showing the power of designing mrcroarrays rn 

this fashion. ' 

Next, the function of the chip sequences with the 

2 0 highest (normalised, signal intensity in brain, regardless 
of expression in other tissues, was assessed. In thrs 
latter analysis, we found expression of many more common 
genes, since the sequences were not limited to those 
expressed only in brain. For example, lookmg - 
2S highest signal intensity spots in brain, 4 wer* , srm la to 
tubulin (AC00807905; AF346191-2; AC007664-4; AF14191-2, , 2 
were similar to actin (AL035701-2; AL034402-1) , and 6 were 
found to be homologous to glyceraldehyde-3 -phosphate 
dehydrogenase (GAPDH) (AL035604-1; Z86090-1; AC006064-L 
30 ACoLo 6 4-K, AC035604-3,. AC0060S4-L. . These genes are often 
used as controls or housekeeping genes in microarray 

experiments of all types. 

Other interesting genes highly expressed rn brain 
were a ferritin heavy chain protein, which is reported in 
35 the literature to be found in brain and liver (Joshi .t 
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^plicated with the array. Other ^ 
seq uences included a translation elongat.cn actor 10 
(A C007564-4), a DEAD-box homolog (AL023804-4) , and a 
5 chromosome RNA-binding motif (Chai .t al„ .™cs 

4^2, -283-39 ( 1 998 ) >(AC007320-3>. A low homology analog 
49(2>.2Bjas> thought to be involved xn 

(AP00123-1/2) to a gene, DSCR1. tnougn 

trisomy 21 (Down's syndrome), showed high expression rn 
Z Train and heart, in agreement with - literature 
(ruentes et al„ Mol. aenet. 4 ,10, .1935-44 (1993,,. 

AS a further validation of the approach, we 
selected the BAC AC006064 to be included oh the array. 
This BAC was Known to contain the GAPDH gene, and thus 
I d be used as a control for the OK* selection process^ 
15 T he gene finding and exon selection algorithms resulted „ 
7 ,na 25 exons from BAC AC006064 for spotting onto the 
22 =f which four were draw, from the CAPBH gen. Table 

'4-«„ of the average expression ratio for 
3 shows the comparison of the average f 

the 4 exons from BAC006064 compared with the average 
expression ratio for 5 different dilutions of a 
commercially available GAPDH cDNA (Clontech) . 
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Table 3 



--^^i^-^O^ssion Ratio, for each 
tissue, of GAPDH 



AC006064 (n = 4) 



Control ( n = 5) 
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Liver 


-1.62 ± 0.22 


-2.07 + 


Lung 


-4.95 + 0.93 


-3.75 ± 0.21 


Placenta 


-3.56 +0.25 


-3.52 ± 0.43 



Each tissue shows excellent agreement between the 
experimentally chosen exons and the control, again 

5 demonstrating the validity of the present exon mining 

approach. In addition, the data also show the variability 
of expression of GAPDH within tissues, calling into 
question its classification as a housekeeping gene and 
utility as a housekeeping control in microarray 

10 experiments. 

EXAMPLE 3 

Representation of Sequence and Expression Data as a 
"Mondrian" 

15 

For each genomic clone processed for microarray 
as above -described, a plethora of information was 
accumulated, including full clone sequence, probe sequence 
within the clone, results of each of the three gene finding 

20 programs, EST information associated with the probe 
sequences, and microarray signal and expression for 
multiple tissues, challenging our ability to display the 
information* 

Accordingly, we devised a new tool for visual 

25 display of the sequence with its attendant annotation 
which, in deference to its visual similarity to the . 
paintings of Piet Mondrian, is hereinafter termed a 
"Mondrian". FIGS. 3 and 4 present the key to the 
information presented on a Mondrian. 

30 FIG. 9 presents a Mondrian of BAC AC008172 (bases 

25,000 to 130,000 shown), containing the carbamyl phosphate 
synthetase gene (AF154830.1) . Purple background within the 

88 
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W0 ; e X s-n as field .1 in 3 indicates all 3 7 known 

exons for ^^Te'seen. GRAIL IX successfully identified 

„ Q (nr>%\ GENEFINDER successfully 

on of the known exons 173*), 
5 l„ tl «.a 3, of the Known exons (100*,. while BICXIO 
identified 7 of the known exons (19%) • . 

seven of the predicted exons were selected for 
phy sicsl assay, of which 5 successfully ^Ufisdby PCR 
Ld were sequenced. These five exons were all found to be 
,0 71 the same gene, the carbamyl phosphate synthetase g ene 

(AF154830 'ihe five exons were arrayed, and gene expression 
measured across 10 tissues. As is readily seen in the 
Mondrian, the five chip sequences on the array show 
„ identical expression patterns, elegantly de.onstrat.rng the 
reproducibility of the system. 

FIG. 10 is a Mondrian of BAC AL049839. We 

n7xr 0 f which 10 successfully 
selected 12 exons from this BAC, or wni 

eguenced, which were found to for. between S 
» inlestingly, 4 of the genes on this BAC are protease 
inhibitors. Again, these data elegantly show that exons 
selected 'from the same gene show the same express-* 
a terns, depicted below the red line. Prom thrs frgure. 
it is clear that our ability to find known genes rs very 
25 LI A novel gene is also found from 8 6 , kb 
' upon which all the axon finding programs agree. We are 
confident we have two exons from a single gene sxnce they 
show the same expression patterns and the exons are 
proximal to each other. Backgrounds in the followrng 
30 colors indicate a known gene (top to bottom) : 
red - kallistatin protease inhibitor (P29622) ; 
purple - Plasma serine protease inhibitor (P05154) ; 
turquoise - .1 anti-chymotrypsin (P01011) , mauve - 408 
ribosomal protein (P08865) . Note that chip sequence 8 
35 12 did not sequence verify. 
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EXAMPLE 4 useful For Measuring 

Genome -Derived Single Exon Probes Useful F 

5 Human Gene Expression 

The protocols set forth in Examples 1 and 2, 
SU pra were applied to additional human genomic sequence as 
7ZZ new" available in OenBanK to identify un^e 
10 • exons in the human genome that could he shown to he 

— Z^J^^~ — h 

str ands P-^^'J^ ± J the exact chemical 
15 microarray; **^J ■ added benefit of sequencing is 
rTp^ possession of a set of single hase- 
iVmented fragments of *, sequenced nucleic acid 
. from the sequencing primer 3' OH. (Since c 

- ::::: rri/— - 

r 2 ,61* sinqie exon prohes. each 

- - - -trssrr- ».« — - o 

pr ohes are dearly presented in the Sequence Listing « . « 

, 12 614 The 16 nt 5' primer sequence and 16 nt 
" rLe sequenced on the empUcon are not included 
n th sequence Usting. The sequences or «-««. 
present within each of these probes is presented rn the 

^ - ^-that'lole'a^ls ha.e 
retrole some — are contained in .ore than 

35 one amplicon. 



ID Nos 
3 

30 



PCT/US01/00665 iw 

WO01W003 R s detailed in E *a.ple 2, expression was 
„trated by disposing the silicons as singleton 
. ' n nuc ieic acid microarrays and then performing t 
r £ iresolnt hybridisation analysis, — 
5 egression is based on a statistical confidence that the 
Zal is significantly greater than negative biological 

ft 1 spots. The negative biological control u , or. 
Ln, spotted DNA seguences fro. a different species .Here, 
^se^ences fro. E.Coli were spotted in duplicate to grve 

10 ' C ° tal ° f F " ^hybridisation .each slide, each colour, 
the .edian value of the signal fro. all of the spots is 
d eter.ined. The nor.alised signal value « the arrth^ 
„ean of the signal fro. duplicate spots drvxded by the 

15 POPUlaC1 ° n :^- spots are eli.inated if there is .ore 
that a five- fold difference between each one of the 

duplicate spots raw signals. .„,__. 

The .edian of the signal fro. the re.arn.ng 
20 control spots is calculated and all subseguent calculates 
are done with normalised signals. 

Control spots having a signal of greater than 
„. „ , 24 (the value 2.4 is roughly 12 times the 
median + 2.4 (the v rontro i spo t populations) 

observed standard deviation of control spot P p 
25 ar e eliminated. Spots with such high, signals are considered 

to be "outliers". z*xa~a 
The mean and standard deviation of the modified 

control spot populations are calculated. ■ 

. T he mean + 3x the standard deviation (mean + 
™ < 3 *SD)) is used as the signal threshold qualifier for that 

letermined for each channel and each 

This means that, assuming that the data rs 

n is a 99% confidence that any 

distributed normally, there is a ?y* 

signal exceeding the threshold is significant. 
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The probes and their expression data are 

presented in Table 4, set forth respectively in Example 5. 
Example 5 presents the subset of probes that is 
significantly expressed in the human lung and thus presents 
5 the subset of probes that was recognized to be useful for 
measuring expression of their cognate genes in human lung 
tissue. 

The sequence of each of the exon probes 
identified by SEQ ID NOS. : 12,615 - 25,001 was individually 
,0 used as a BLAST (or, for SWISSPROT, BLASTX) query to 
identify the most similar sequence in each of dbEST, 
SwissProt (BLASTX), and NR divisions of GenBank. Because 
the query sequences are themselves derived from genomic 
sequence in GenBank, only nongenomic hits from NR were 

15 SCOrGd " The smallest in value of the BLAST (or BLASTX) 
expect («E«) scores for each query sequence across the 
three database divisions was used as a measure of the 
.-expression* novelty" of the probe-s ORE. Table 4 is sorted 
20 in descending order based on this measure, reported as 

-Most Similar (top) Hit BLAST E Value" . Those sequences for 
which no -Hit E Value" is listed are those exons which were 
found to have no similar sequences. 

As sorted, Table 4 thus lists its respective 
25 probes (by "AMPLICON SEQ ID NO. :" and additionally by the 
SEQ ID NO:, of the exon contained within the probe : " EXON 
SEQ ID NO.:") from least similar to sequences known to be 
expressed (i.e., highest BLAST E value), at the beginning 
of the table, to most similar to sequences known to be 
30 expressed (i.e., lowest BLAST E value), at the bottom of 
the table. 

Table 4 further provides, for each listed probe, 
the accession number of the database sequence that yielded 
the "Most Similar (top) Hit BLAST E Value", along with the 
35 name of the database in which the database sequence is 
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found ("Top Hit Database Source") . 

Table 4 further provides SEQ ID NOS. 
corresponding to the predicted amino acid sequences where 
they have been determined for the probe and exon nucleotide 
5 sequences. These are set out as PEPTIDE SEQ ID NOS.:. The 
peptide sequences for a given exon are predicted as 
follows: Since each chip exon is a consensus sequence drawn 
from predictions from various exon finding programs (i.e. 
Grai l, GeneFinder and GenScan) , the multiple initial ORFs 
10 ar- *irst determined in a uniform way according to each 

prediction. In particular, the reading frame for predicting 
the first amino acid in the peptide sequence always starts 
with the first base of any codon and ends with the last 
base of non-termination codon. Next, for each strand of the. 
15 exon, initial ORFs are merged into one or more final ORFs 
in an exhaustive process based on the following criteria: 
1) the merging ORFs must be overlapping, and 2) the merging 
ORFs must be in the same frame. 

The Sequence Listing, which is a superset of all 
20 of the data presented in Table 4, further includes, for 

each probe, the most similar hit, with accession number and 
BLAST E- value, from the each of the three queried 

databases . . 

Table 4 further lists, for each probe, a portion 

25 of the descriptor for the top hit ("Top Hit Descriptor") as 

provided in the sequence database. For those ORFs that are 

similar in sequence, but nonidentical to known sequences 

(e g those with BLAST E values between about le-05 and 

le -i00) , the descriptor reveals the likely function of the 

30 protein encoded by the probe's ORF. 

Using BLAST E value cutoffs of le-05 (i.e., 1 x 

10-) and le-100 (i.e., 1 x 10"" 0 ) as evidence of similarity 

to sequences known to be expressed is of course arbitrary: 

in Example 2, supra, a BLAST E value of 1.-30 was used as 

the boundary when only two classes were to be defined for 
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analysis (unknown, >le-30; known <le-30) (see also FIG. 8) . 
Furthermore, even when the "Most Similar (Top) Hit BLAST E 
Value" is low, e.g., less than about le-100 - which is 
probative evidence that the query sequence has previously 
5 been shown to be expressed - the top hit is highly unlikely 
exactly to match the probe sequence. 

First, such expression entries typically will not 
have the intronic and/or intergenic sequence present within 
the single exon probes listed in the Table. Second, even 
10 the ORF itself is unlikely in such cases to be present 
identically in the databases, since most of the EST and 
mRNA clones in existing databases include multiple exons, 
without any indication of the location of exon boundaries. 

As noted, the data presented in Table 4 represent 
15 a proper subset of the data present within the attached 
sequence listing. For each amplicon probe (SEQ ID NOs . : 1 
- 12,614) and probe exon (SEQ ID NOs.: 12,615 - 25,001, 
respectively), the sequence listing further provides, 
through iterated annotation fields <220> and <223>: 
20 ^ (a) the accession number of the BAG from which 

the sequence was derived ( "MAP TO") , thus providing a link 
to the chromosomal map location and other information about 
the genomic milieu of the probe sequence; 

(b) the most similar sequence provided by BLAST 
25 query of the EST database, with accession number and BLAST 

E value for the "hit"; 

(c) the most similar sequence provided by BLAST 
query of the GenBank NR database, with accession number and 
BLAST E value for the "hit"; and 

30 (d) the most similar sequence provided by BLASTX 

query of the SWISSPROT database, with accession number and 
BLAST E value for the "hit". 
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Tnle-Berived Single Kxon ,robes useful - -suring 
Expression of Genes in Human Lung 

Table 4 (523 pages) presents expression, homology, and 
^tlL 1 information for the genome-derived single exon 

LUA1 . . 1* Vinman lima. 
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functional iui.uim«^-» 

probes that are expressed significantly in human lung. 
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CLAIMS 

! A spatially-addressable set of single exon nucleic acid 
probes for measuring gene expression in a sample derived 
5 from human lung comprising a plurality single exon nucleic 
probes, said probes comprising any one of the nucleotxde 
sequences set out in SEQ ID NOs : 1 - 12,614 or a 
complementary sequence, or a portion of such a sequence. 

10 2 A spatially-addressable set of single exon nucleic acid 
probes as claimed in claim 1 wherein each of said plurality 
of probes is separately and addressably amplifiable. 

3 A spatially-addressable set of single exon nucleic acid 
,5 probes as claimed in claim 1 wherein each of said plurality 
of probes is separately and addressably isolatable from 
said plurality. 

. 4 A spatially-addressable set of single exon nucleic acid 
20 probes as claimed in any of claims 1 to 3 wherein said 

probes comprise any one of the nucleotide sequences set out 
in SEQ ID NOS.: 12,615 - 25,001. 

5 A spatially-addressable set of single exon nucleic acid 
25 probes as claimed in any of claims 1 to 4, wherein each of 
said plurality of probes is amplifiable using at least one 
common primer. 

o A spatially-addressable set of single exon nucleic acid 
probes as claimed in any of claims 1 to 5 wherein the set 
comprises between 50 - 20,000 single exon nucleic acid 
probes. 

7 A spatially-addressable set of single exon nucleic acid 
35 probes as claimed in any of claims 1 to 6, wherein the 
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*°JZZ **** - ^ ~° n nucleic aoid probes is 

between 200 and 500 bp. 

. * SD atially-addressable set of single exon nucleic. acid 
5 "tesTs c aLd in an y of claims 1 to wherein at least 
Z of said single exon nucleic acid probes lac, 
prokaryotic and bacteriophage vector sequence. 

, A spatially-addressable set o £ single axon nucleic acid 
„ Probes as claimed in any of claims 1 to 8. wherem at least 
ZTof said single exon nucleic acid probes lac, . 
homopolymeric stretches of A or T. 

10 A spatially-addressable set of single exon nucleic acid 
B rob es a claiLd in any of claims 1 - . charactered rn 
let said set of probes is addressably disposed upon a 
substrate. 

20 ^rS." —s silicon, crystalline silicon 
and plastic. 

12 a microarray. comprising a spatially addressable set of 



1-11. 
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X3 A single exon nucleic acid probe for measuring human 
aene expression in a sample derived from human lung 

mprislng a nucleotide seance as set out rn any of 
TD NOs ■ 1 - 12.6" or a complementary sequence or 

agm nt thereof wherein said probe hybridises at hrgh 

*.„ » nucleic acid molecule expressed xn the 
stringency to a nucxexc 
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14 . A single exon nuclexc acxd P ^ ^ ^ ^ ^ ^ ^ 

compr isin, , . complementary seance or a 

ID NOS. : 12, 61b 
fragment thereof. 

is a nucleic acxci mox of 

10 SES ID - - 3-° ; saw probe hybridizes at high 

or . fra^nt thereof* ^ ^ ^ ^ ^ 

stringency to a nucxei 

m.ic acid probe as claimed in any one 
16 . A single exon nuclexc nucleic aci d 

■ ^ t-n 15 wherexn saia smy^ 
15 of claxms 13 to 15 contig uous nucleotxdes of 

probe comprxses between 
said SEQ ID NO. 

, acid probe as claimed in any one 

20 of claxms 13 to x^, 



in length. 



25 



,-lsic acid probe as claimed in any one 
,l e ic acid prcbe as. claimed in any one 



of claims 13 
labeled 
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bacteriophage vector sequence. 

.1 e ic acid probe as claimed in any one 



35 of claims 13 
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22. A method of measuring gene expression in a sample 
derived from human lung, comprising: 

5 contacting the microarray of claim 12, with a first 

collection of detectably labeled nucleic acids, 
said first collection of nucleic acids derived 
from mRNA of human lung; and then 
measuring the label .detectably bound to each probe of 
10 said microarray. 

23. A method of identifying exons in a eukaryotic genome, 
comprising: 

algorithmically predicting at least one exon from 
15 genomic sequence of said eukaryote; and then 

detecting specific hybridization of detectably labeled 
nucleic acids to a single exon probe, 
wherein said detectably labeled nucleic acids are derived 
from mRNA from the lung of said eukaryote, said probe is a 
20 single exon probe having a fragment identical in sequence 
to, or complementary in sequence to, said predicted exon, 
said probe is included within a microarray according to 
claim 12, and said fragment is selectively hybridizable at 
high stringency. 

25 

24. A method of assigning exons to a single gene, 
comprising: 

identifying a plurality of exons from genomic 
sequence according to the method of claim 23; and 

30 then 

measuring the expression of each of said exons in a 
plurality of tissues . and/ or cell types using 
hybridization to single exon microarrays having a 
probe with said exon, 

35 wherein a common pattern of expression of said exons in 
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W ° 01/86003 »nrf/or cell types indicates that 

said plurality of tissues and/or cell typ 

the exons should be assigned to a single gene. 

25 . A nucleic acid sequence as set out in any of SEQ ID 
5 NOs: 1 - 25,001 which encodes a peptide. 

26. A peptide encoded by a sequence as set out in any of 
SEQ ID Nos: 1 - 25,001. 

10 27. A peptide comprising a sequence as set out in any of 
SEQ ID Nos: 25,002 - 37,012. 
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L1.t1 L1 repetitive element ; 
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; Messenger RNA for anglerfish (Lophlus americanus) somatostatin II | 
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ADAM-TS 5 PRECURSOR (A DISINTEGRIN AND METALLOPROTEINASE WITH THROMBOSPONDIN 
MOTIFS 5) (ADAMTS-5) (ADAM-TS5) (AGGRECANASE-2) (ADMP-2) (IMPLANTIN) 


ADAM-TS 5 PRECURSOR (A DISINTEGRIN AND METALLOPROTEINASE WITH THROMBOSPONDIN 
MOTIFS 5) (ADAMTS-5) (ADAM-TS5) (AGGRECANASE-2) (ADMP-2) (IMPLANTIN) 
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B xl Observati nswher certain claims were f und un earchable (Continuation f item 1 of first sheet) 
This International Search Report has not been established In respect of certain claims under Article 17(2)(a) for the following reasons: 
1. I I Claims Nos.: 

— because they relate to subject matter not required to be searched by this Authority, namely: 



Z Pn Claims Nos.: 

— because they relate to parts of the International Application that do not comply with the prescribed requirements to such 
an extent that no meaningful Internationa! Search can be carried out, specifically: 

see FURTHER INFORMATION sheet PCT/ISA/210 

3. Claims Nos.: 

— because ttiey are dependent claims and are not drafted In accordance with the second and third sentences of Rule 6.4(a). 

Box II Observations where unity of invention is lacking (Continuation of item 2 of first sheet) 

This International Searching Authority found multiple inventions in this international application, as follows: 

see additional sheet 



1. 



□ As all required additional search fees were timely paid by the applicant, this International Search Report covers all 
searchable claims. 



2. I™] As all searchable claims could be searched without effort justifying an additional fee, this Authority did not invite payment 
of any additional fee. 



3. IT] As only some of the required additional search fees were timely paid by the applicant this International Search Report 
covers only those claims tor which fees were paid, specifically claims Nos.: 

1-27 (partially) 



4. I 1 No required additional search fees were timely paid by the applicant Consequently, this International Search Report is 
— restricted to the Invention first mentioned in the claims; it Is covered by claims Nos.: 



Remark on Protest j | The additional search fees were, accompanied by the applicant's protest. 

| y j No protest accompanied the payment of additional search fees. 
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Continuation of Box 1.2 



The following statements about the Impossibility of performing a 
meaningful search according to Art. 17(2) PCT are made for the subject 
matter for which a search has been performed and Identified as the first 
Invention In form 206 PCT. If additional fees are paid for the (one or 
more) as yet unsearched inventions, similar statements about Incomplete 
searches could be issued. 

Present claims 1-12 and 22-24 relate to an extremely large number of 
possible sets of nucleic acid probes comprising SEQ ID NOS:l-3 as well as 
microarrays comprising said sets. In fact, the claims contain so many 
possible permutations that a lack of clarity and conciseness within the 
meaning of Article 6 PCT arises to such an extent as to render a 
meaningful search of the claims impossible. Consequently, 
the search for the sets of probes comprising SEQ ID NOS: 1-3 has been 
limited to SEQ ID N0S:l-3 as such. 

Claims 1-3, 5, 6, 8-15 and 18-24 relate to portions or fragments of 
nucleic acids defined by SEQ ID N0S:l-3. The length or other similar 
characterizing features of the portions or fragments is not disclosed, 
bringing the total number of possible prior art sequences to 
exceptionally high numbers. The shorter the length, the higher the 
possibility that an overflow of, in principle unrelated, sequences are 
retrieved, making the establishment of a meaningful International Search 
Report impossible. For this reason the search has been limited to 
portions or fragments of SEQ ID N0S:l-3 having a significant minimum 
length and being supported by the description, namely at least 15 
contiguous nucleotides (see claim 16). 

Claims 15-21 relate to an extremely large number of nucleic add probes. 
The probes are defined solely by their potential to code for peptide SEQ 
ID NOS: 25010 and 25011. However, due to the degeneracy of the genetic 
code, this peptide 1s potentially coded by an extremely high number of 
nucleic acid sequences. In fact, the claims contain so many potential 
nucleic acid sequences that a lack of clarity and conciseness within the 
meaning of Article 6 PCT arises to such an extent as to render a 
meaningful search over the whole scope of the claims impossible. The 
search has therefore been carried out for those parts of the claims which 
do appear to be clear and concise, namely the nucleic acid sequences 
disclosed in the application and identified as encoding the referred 
(poly)peptide 1n table 4 (SEQ ID NOS: 1-3, 12623, 12624, 25010 and 
25011). 

Likewise, claim 26, which refers to peptides encoded by SEQ ID N0S:l-3 
and SEQ ID NOS: 12623 and 12624, encompasses a high and undefined number 
of possible peptides. Besides three possible reading frames deriving from 
the encoding nucleic acid strand, as well as three additional reading 
frames deriving from the complementary nucleic acid strand, every 
possible fragment of these is being covered by the claim. This is due to 
the potential presence of stop codons within any of the six possible 
reading frames which can not be established a priori. Thus, claim 26 
contains so many potential peptide sequences that a lack of clarity and 
conciseness within the meaning of Article 6 PCT arises to such an extent 
as to render a meaningful search over the whole scope of the claim 
Impossible. Consequently, the search has been carried out for those parts 
of the claim which do appear to be clear and concise, namely the peptide 
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disclosed, Identified by SEQ ID NOS: 12623 and 12624. 

The applicant's attention is drawn to the fact that claims, or parts of 
claims, relating to inventions in respect of which no international 
search report has been established need not be the subject of an 
international preliminary examination (Rule 66.1(e) PCT). The applicant 
is advised that the EPO policy when acting as an International 
Preliminary Examining Authority is normally not to carry out a 
preliminary examination on matter which has not been searched. This is 
the case irrespective of whether or not the claims are amended following 
receipt of the search report or during any Chapter II procedure. 
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This International Searching Authority found multiple (groups of) 
inventions in this international application, as follows: 

Invention 1: Claims 1-27. (partial) 

A nucleic acid probe comprising SEQ ID N0S:1 or 2, 
complementary sequences or fragments thereof, in particular 
comprising SEQ ID N0:12623, spatially addressable sets of 
probes comprising said sequence(s), microarrays comprising 
said sets, a method for measuring gene expression, a method 
for identifying exons, a method for assigning exons to a 
single gene comprising the use of said arrays and peptides 
encoded by SEQ ID N0S:1,2 and 12623 (in particular the one 
defined by SEQ ID NO:25010). 



Invention 2: Claims 1-27 (partial) 

A nucleic acid probe comprising SEQ ID NO: 3, complementary 
sequences or fragments thereof, in particular comprising SEQ 
ID N0:12624, spatially addressable sets of probes comprising 
said sequence(s), microarrays comprising said sets, a method 
for measuring gene expression, a method for identifying 
exons, a method for assigning exons to a single gene 
comprising the use of said arrays and peptides encoded by 
SEQ ID N0:3 and 12624 (in particular the one defined by SEQ 
ID MO:25011) V 



Inventions 3 to 12614: Claims 1-27 (partial) 

A nucleic acid probe comprising SEQ ID N0:n (where n ranges 
from 4-12614 according to the invention number above; as 
disclosed in table 4), complementary sequences or fragments 
thereof, in particular comprising the SEQ ID NO. which is 
listed in the column "Exon Seq. Id. no. D in the same 
row within table 4 that contains Seq. Id. n, spatially ' 
addressable sets of probes comprising said sequence, 
microarrays comprising said sets, a method for measuring 
gene expression, a method for identifying exons, a method 
for assigning exons to a single gene comprising the 
use of said arrays and peptides encoded by SEQ ID N0:n, in 
particular the one defined by the SEQ ID NO in the column 
u 0RF Seq. Id. no." of the same row where SEQ ID N0:n is 
listed. 
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